Chapter 2. Fundamentals of Deep Networks

Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!

The Red Queen, Through the Looking Glass

Defining Deep Learning

In the Chapter 1 we set up the foundations of machine learning and neural networks. In this chapter we’ll build on these foundations to give you the core concepts of deep networks. This will help build your understanding of what is going on in different network architectures as we progress into the specific architectures in Chapter 3. Let’s begin by restating our definitions of both deep learning and deep networks.

What Is Deep Learning?

The facets that differentiate deep learning networks in general from “canonical” feed-forward multilayer networks are as follows:

  • More neurons than previous networks
  • More complex ways of connecting layers
  • “Cambrian explosion” of computing power to train
  • Automatic feature extraction

When we say “more neurons,” we mean that the neuron count has risen over the years to express more complex models. Layers also have evolved from each layer being fully connected in multilayer networks to locally connected patches of neurons between layers in Convolutional Neural Networks (CNNs) and recurrent connections to the same neuron in Recurrent Neural Networks (in addition to the connections from the previous layer).

More connections means that our networks have more parameters to optimize, and this required the explosion in computing power that occurred over the past 20 years. All of these advances provided the foundation to build next-generation neural networks capable of extracting features for themselves in a more intelligent fashion. This allowed deep networks to model more complex problem spaces (e.g., image recognition advances) than previously possible. As industry demands are ever changing and ever reaching, the capabilities of neural networks have had to charge forward. The Red Queen1 would have it no other way.

Defining deep networks

To further provide color to our definition of deep learning, here we define the four major architectures of deep networks:

  • Unsupervised Pretrained Networks
  • Convolutional Neural Networks
  • Recurrent Neural Networks
  • Recursive Neural Networks

There is continuous research in the domain of neural networks, but for the purposes of this lesson, we’ll focus on these four architectures. These architectures have evolved over the past 20 years. Let’s take a quick look at some of the highlights, with a history lesson on the history of feed-forward multilayer neural networks.

Evolutionary progress and resurgence

When we last left off in Chapter 1, neural networks had entered a “winter period” in the mid-1980s when the promise of AI fell short of what it could deliver. As happens many times when promising technology falls into the Trough of Disillusionment (Figure 2-1), there were many researchers still doing important work in the realm of neural networks.

Figure 2-1. Trough of Disillusionment (source: https://en.wikipedia.org/wiki/Hype_cycle)

One important development in neural networks was Yann LeCun’s work at AT&T Bell Labs on optical character recognition.3 His lab was focused on check image recognition for the financial services sector. Through this work, LeCun and his team developed the concept of the biologically inspired model of image recognition we know today as CNN. This eventually led to the creation of the MNIST handwriting benchmark (we cover this more later in the chapter) and a progressive number of record accuracy marks achieved by deep learning.

Better Labeled Data

Another contributing factor to the evolution and success of deep networks was the creation of better and larger labeled datasets such as MNIST and ImageNet.

Advances in modeling sequential data with recurrent neural networks appeared in the late-1980s and early 1990s by researchers such as Sepp Hochreiter. As time went on, the research community created better artificial neuron variants (e.g., Long Short-Term Memory [LSTM] Memory Cell and Memory Cell with Forget Gate) over the course of the late 1990s.​ The stage for a neural network resurgence was being set quietly in research labs around the world.

During the 2000s, researchers and industry applications began to progressively apply these advances in products such as the following:

Self-driving cars in the 2006 Darpa Grand Challenge used many techniques beyond just deep learning. The top teams (Stanford and Carnegie Mellon University) were able to take advantage of the big improvements of image processing.

Advances in Computer Vision

In 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton developed a “large, deep convolutional neural network” that won the 2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge).

AlexNet4 was hailed as an advancement in computer vision and some credit it specifically with kicking off the deep learning craze. However, it was largely a scaled-up (e.g., deeper and wider) variant of the CNNs from the 1990s. The recent advances in computer vision were driven less by recent algorithm advances and more by better compute, data, and infrastructure.

Better image analysis allowed the planning systems in the cars to better choose paths through uncertain terrain and avoid obstacles more safely. Other advances in deep learning allowed models to more accurately translate and recognize audio data, driving value in the Google Translate and Amazon Echo line of products. Most recently, we’ve seen another complex game fall at the master level when the AlphaGo system beat the 9-dan professional Go player Lee Sedol.

Big advances in what machine learning can accomplish are not always easy to see. Public recognition for these advances many times is the culmination of many different lines of work that is exhibited in high-profile demonstrations such as The Darpa Grand Challenge or Watson beating Ken Jennings in Jeopardy. However, behind the scenes, the underpinnings for these advances change slowly but constantly. Just like the changing of the seasons, we don’t always notice these changes in our daily lives until they’ve crossed some threshold.

In the near future, we’ll continue to see deep learning being applied in unique and innovative ways. This application will be more of the latent intelligence variety (e.g., recommendations or voice recognition) coupled with pragmatic engineering to make them useful in everyday aspects of our lives. What we’re unlikely to see (in the near term, at least) are out of control malevolent artificial agents blowing us out of airlocks at inopportune times (think HAL 9000 from 2001: A Space Odyssey).

Deep learning continues to push the field forward in many domains and on many core machine learning problems. Here are just a few of the benchmark records deep learning has achieved in the last few years:

  • Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014)
  • Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014)
  • Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014)
  • Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014)
  • Medium vocabulary speech recognition (Geiger et al., Interspeech 2014)
  • English-to-French translation (Sutskever et al., Google, NIPS 2014)
  • Audio onset detection (Marchi et al., ICASSP 2014)
  • Social signal classification (Brueckner & Schulter, ICASSP 2014)
  • Arabic handwriting recognition (Bluche et al., DAS 2014)
  • TIMIT phoneme recognition (Graves et al., ICASSP 2013)
  • Optical character recognition (Breuel et al., ICDAR 2013)
  • Image caption generation (Vinyals et al., Google, 2014)
  • Video-to-textual description (Donahue et al., 2014)
  • Syntactic parsing for natural language processing (Vinyals et al., Google, 2014)
  • Photo-real talking heads (Soong and Wang, Microsoft, 2014)

Based on these accomplishments, we can easily project deep learning to impact many applications over the next decade. Some of the more impressive demonstrations of applied deep learning include the following:

We probably won’t realize all of the major commercial applications until they’re right in front of our faces. Understanding the advances in deep network architecture is important in order to understand application ideas going forward.

Advances in network architecture

As research pressed the state of the art forward from multilayer feed-forward networks toward newer architectures like CNNs and Recurrent Neural Networks, the discipline saw changes in how layers were set up, how neurons were constructed, and how we connected layers. Network architectures evolved to take advantage of specific types of input data.

Advances in layer types

Layers became more varied with the different types of architectures. Deep Belief Networks (DBNs) demonstrated success with using Restricted Boltzmann Machines (RBMs) as layers in pretraining to build features. CNNs used new and different types of activation functions in layers and changed how we connected layers (from fully connected to locally connected patches). Recurrent Neural Networks explored the use of connections that better modeled the time domain in time-series data.

Advances in neuron types

Recurrent Neural Networks specifically created advancements in the types of neurons (or units) applied in the work around LSTM networks. They introduced new units specific to Recurrent Neural Networks such as the LSTM Memory Cell and Gated Recurrent Units (GRUs).

Hybrid architectures

Continuing the theme of matching input data to architecture type, we have seen hybrid architectures emerge for types of data that has both a time domain and image data involved. For instance, classifying objects in video has been successfully demonstrated by combining layers from both CNNs and Recurrent Neural Networks into a single hybrid network. Hybrid neural network architectures can allow us to take advantage of the best of both worlds in some cases.

From feature engineering to automated feature learning

Although deep networks might have innovated with new units and layers for their internals, they are still fundamentally capped with a discriminatory classifier at the end with the constructed features as input. Automating feature extraction is a common theme among the various architectures. Each architecture does feature construction differently and is specialized such that it is better at certain types of input than others. Yann LeCun hit on this theme describing deep learning when he said it was: “machines that learn to represent the world.”

Geoffrey Hinton talks about this theme in DBNs when he explains how RBMs are used to decompose the data into higher-order features.5

Categorizing DBNs

For the purposes of this lesson, we place DBNs (and autoencoders) in the UPN group of deep networks.

Staying with the image classification theme, we can use the example of face detection. Raw image data of faces as input have issues with how the face is oriented, lighting of the photo, and position of the key features of the face. The key features we’d normally associate with a face are things like the edge of the face, edges of specific features like eyes and nose, and then subtle features we don’t consistently see like dimples.

Feature engineering

Handcrafting features has been a hallmark of machine learning for a long time. Practitioners who win competitions in machine learning often study the dataset thoroughly and use many arcane tricks to make the learning process as simple as possible for their learning algorithm. These datasets are often columnar/tabular text data and we can apply domain knowledge to specific columns so feature creation is more direct.

The input data is the A matrix in the equation Ax = b, we can see how we had to hand-code the values from the data into those specific columns of A. These handcrafted features tend to produce highly accurate models but take a lot of time and experience to produce. From a knowledge representation perspective, it’s like reading a poorly written book versus a book that is well-written and easy to read. The former takes us a lot longer to read and we need to spend more energy to get the same out of it as the latter.

Image classification is an interesting example because handcrafting image features is more difficult than creating features for tabular data. The information in images is not constrained to stay in the same column and can be influenced by lighting, angle, and other issues. Feature extraction and creation for images needed a new approach, which in some part drove the evolution of CNNs.

Feature learning

Coming back to our face-detection example, a nose can be located in any set of pixels in an image as opposed to our bank balance always being located in a specific column in tabular data. With CNNs, we train the network to understand the edge of the nose and then the general shape of the nose from lower-level “nose-edge” features. The first layers in the network might pick up those nose-edge features and then pass them on to later layers in the network as larger feature maps.

These more granular patches of feature maps eventually are combined into a “face” feature at the latter layers of the CNN. This allows a CNN to take on a task that has been attempted many times before (“Is this a face?”) yet pose the question in a simpler way that takes less energy to answer in a more accurate way.

Automated Feature Learning with Complex Data

Taking complex raw data and creating higher-order features automatically in order to make a simpler classification (or regression) output is a hallmark of deep learning.

As you progress through this lesson, you’ll get a better sense of how to match input data types to deep network architectures and how to set up these architectures to best model the underlying dataset.

Generative modeling

Generative modeling is not a new concept, but the level to which deep networks have taken it has begun to rival human creativity. From generating art to generating music to even writing beer reviews, we see deep learning applied in creative ways every day. Recent variants of generative modeling to note include the following:

  • Inceptionism
  • Modeling artistic style
  • Generative Adversarial Networks
  • Recurrent Neural Networks

Let’s quickly review each of these.

Inceptionism

Inceptionism is a technique in which a trained convolutional network is taken with its layers in reverse order and given an input image coupled with a prior constraint. The images are modified iteratively to enhance the output in a manner that could be described as “hallucinative.” In examples for which the input involves images of the sky, we might see fish faces appear in clouds of the output image. This line of research from Google has shown discriminatory neural network models contain considerable information to generate images.

Modeling artistic style

Variants of convolutional networks have shown to learn the style of specific painters and then generate a new image in this style of arbitrary photographs. Figure 2-2 shows the amazing results. Imagine having your family photo painted by Vincent van Gogh. (By the time this lesson is published, this will probably be a Snapchat filter, so you won’t have to wait that long.)

Figure 2-2. Stylized images by Gatys et al., 2015

In 2015, Gatys et al. published a paper titled “A Neural Algorithm of Artistic Style”6 in which they separate the style and the content of a painting. The CNN extracts the artist’s style into the network’s parameters, which can later be applied to arbitrary images to be rendered in the same style.

GANs

The generative visual output of a GAN7 can best be described as synthesizing novel images by modeling the distributions of input data seen by the network. We cover GANs in more depth in Chapter 3.

Recurrent Neural Networks

Recurrent Neural Networks have been shown to model sequences of characters and generate new sequences that are lucidly coherent.

Another interesting application of Recurrent Neural Networks is the work by Lipton and Elkan in which the network models proper nouns like “Coors Light” and other aspects of beer jargon. The generated beer reviews can be guided with hints (e.g., “give me a 3-star review of a German lager”) and are impressive. Here’s a sample beer review generated by the program:

On tap at the brewpub. A nice dark red color with a nice head that left a lot of lace on the glass. Aroma is of raspberries and chocolate. Not much depth to speak of despite consisting of raspberries. The bourbon is pretty subtle as well. I really don’t know that find a flavor this beer tastes like. I would prefer a little more carbonization to come through. It’s pretty drinkable, but I wouldn’t mind if this beer was available.8

The Tao of deep learning

There is a lot of marketing noise and hype in the realm of deep learning today, some of it justifiably so. However, deep learning is still trying to answer the same fundamental machine learning questions like: “Is this image a face?” The difference is that deep learning has taken the previous generation’s neural network techniques and added advanced automated feature construction to make computationally difficult questions on complex data easier to answer.

When you use deep learning as a practitioner, the best way to take advantage of this power is to match the input data to the appropriate deep network architecture. If you do this, you can apply deep learning successfully in new and interesting ways. If you don’t, you won’t add any new modeling power beyond basic techniques like logistic regression. The remainder of this lesson is dedicated to giving you, as practitioner, the skills and context necessary to make these decisions and use deep learning well.

Organization of This Chapter

In this chapter, we dig further into specific architectures for deep networks. We’ll differentiate the architectures and break down how their components evolved differently, providing color on how this better extracts features from certain types of data. We close the chapter with some discussion of the practicality of deep learning and alleviate some misconceptions surrounding the domain today. With that, let’s continue our discussion of the architecture components relevant to deep networks.

Common Architectural Principles of Deep Networks

Before we get into the specific architectures of the major deep networks, let’s extend our understanding of the core components. First, we’ll reexamine the core components again as follows and extend their coverage for the purposes of understanding deep networks:

  • Parameters
  • Layers
  • Activation functions
  • Loss functions
  • Optimization methods
  • Hyperparameters

Next, we’ll take these concepts and build on them to better understand the building block networks of deep networks, such as the following:

  • RBMs
  • Autoencoders

We’ll then continue to build on these ideas by reviewing these specific deep network architectures:

  • UPNs
  • CNNs
  • Recurrent neural networks
  • Recursive neural networks

As we work our way through this chapter, we’ll also drop references to how DL4J implements certain aspects of deep networks. For now, let’s continue our review of parameters to better understand how they are extended for deep networks.

Parameters

Parameters relate to the x parameter vector in the equation Ax = b in basic machine learning. Parameters in neural networks relate directly to the weights on the connections in the network. We can see the parameter vector represented by the x column vector. We take the dot product of the matrix A and the parameter vector x to get our current output column vector b. The closer our outcome vector b is to the actual values in the training data, the better our model is. We use methods of optimization such as gradient descent to find good values for the parameter vector to minimize loss across our training dataset.

In deep networks, we still have a parameter vector representing the connection in the network model we’re trying to optimize. The biggest change in deep networks with respect to parameters is how the layers are connected in the different architectures. In DBNs, we see two parallel sets of feed-forward connections with two separate networks. One network’s layers are composed of RBMs (subnetworks in their own right, which we’ll review later in the chapter) used to extract features for the other network. The other network in a DBN is a regular feed-forward multilayer neural network, which uses the features extracted from the RBMs–layer network to initialize its weights. This is just one example of many that we’ll see over the course of this chapter in how parameters/weights are specialized in different deep network architectures.

Parameters and NDArrays

In terms of working with the core linear algebra of deep networks, DL4J relies on the ND4J library to represent these linear algebra primitives. NDArrays and linear algebra are key to working with neural networks in DL4J.

Layers

Input, hidden, and output layers define feed-forward neural networks. In Chapter 1, we further expanded this architecture with more types of layers and discussed how they relate to specific architectures of deep networks. Layers also can be represented by subnetworks in certain architectures, as well. In the previous section, we used the example of DBNs having layers composed of RBMs.

Layers are a fundamental architectural unit in deep networks. In DL4J we customize a layer by changing the type of activation function it uses (or subnetwork type in the case of RBMs). We’ll also look at how you can use combinations of layers to achieve a goal (e.g., classification or regression). Finally, we’ll also explore how each type of layer requires different hyperparameters (specific to the architecture) to get our network to learn initially. Further hyperparameter tuning can then be beneficial through reducing overfitting.

Activation Functions

In this chapter, we begin to illustrate how activation functions are used in specific architectures to drive feature extraction. The higher-order features learned from the data in deep networks are a nonlinear transform applied to the output of the previous layer. This allows the network to learn patterns in the data within a constrained space.

Activation functions for general architecture

Depending on the activation function you pick, you will find that some objective functions are more appropriate for different kinds of data (e.g., dense versus sparse). We group these design decisions for network architecture into two main areas across all architectures:

  • Hidden layers
  • Output layers

Hidden layers are concerned with extracting progressively higher-order features from the raw data. Depending on the architecture we’re working with, we tend to use certain subsets of layer activation functions. As we work through this chapter, we’ll illustrate these patterns more across DBNs, CNNs, and Recurrent Neural Networks. In Chapter 3, we take a deeper look at the impacts of different activation functions on different network architectures in the context of tuning deep networks.

A Note About Input Layers

Typically for the input layer, we want to pass on the raw input vector features so that in practice we don’t express an activation function for the input layer.

Hidden layer activation functions

Commonly used functions include:

  • Sigmoid
  • Tanh
  • Hard tanh
  • Rectified linear unit (ReLU) (and its variants)

A more continuous distribution of input data is generally best modeled with a ReLU activation function. Optionally, we’d suggest using the tanh activation function (if the network isn’t very deep) in the event that the ReLU did not achieve good results (with the caveat that there could be other hyperparameter-related issues with the network).

Sigmoid Activation Functions in Practice

In recent years, we’ve seen the sigmoid activation function fall out of favor for hidden layers in practice and research.

As we progress further in this lesson, we’ll see how these activation functions tend to be arranged in the different architectures.

The Evolution of Activation Functions in Practice

We’re also seeing an entire family of ReLUs emerge in deep learning research, such as the “leaky ReLU.”

Output layer for regression

This design decision is motivated by what type of answer we expect our model to output. If we want to output a single real-valued number from our model, we’ll want to use a linear activation function.

Output layer for binary classification

In this case, we’d use a sigmoid output layer with a single neuron to give us a real value in the range of 0.0 to 1.0 (excluding those values) for the single class. This real-valued output is typically interpreted as a probability distribution.

Output layer for multiclass classification

If we have a multiclass modeling problem yet we only care about the best score across these classes, we’d use a softmax output layer with an arg‐max() function to get the highest score of all the classes. The softmax output layer gives us a probability distribution over all the classes.

Getting Multiple Classifications

If we want to get multiple classifications per output (e.g., person + car), we do not want softmax as an output layer. Instead, we’d use the sigmoid output layer with n number of neurons, giving us a probability distribution (0.0 to 1.0) for every class independently.

Loss Functions

In Chapter 1 we introduced loss functions and their role in machine learning. Loss functions quantify the agreement between the predicted output (or label) and the ground truth output. We use loss functions to determine the penalty for an incorrect classification of an input vector. So far, we’ve introduced the following loss functions:

  • Squared loss
  • Logistic loss
  • Hinge loss
  • Negative log likelihood

Previously we described loss functions as falling into one of three camps:

  • Regression
  • Classification
  • Reconstruction

The third, reconstruction, is involved in unsupervised feature extraction and is an important reason why deep learning networks have achieved their record-breaking accuracy. In certain architectures of deep networks reconstruction loss functions help the network extract features more effectively when paired with the appropriate activation function. An example of this would be using the multiclass cross-entropy as a loss function in a layer with a softmax activation function for classification output. We cover a specialized loss function in the next section.

Reconstruction cross-entropy

With the reconstruction entropy loss function, we first apply “Gaussian noise” (a kind of statistical white noise), and then the loss function punishes the network for any result that is less similar to the original input data. This feedback drives the network to learn different features in an attempt to reconstruct the input more effectively and minimize error. In deep learning, reconstruction entropy loss is used for feature engineering in the pretrain phase that involves RBMs.

Optimization Algorithms

Training a model in machine learning involves finding the best set of values for the parameter vector of the model. We can think of machine learning as an optimization problem in which we minimize the loss function with respect to the parameters of our prediction function (based on our model).

Defining “Best” in Terms of the Loss Function

In optimization algorithms, we define “best set of values” for the parameter vector as the values with the lowest loss function value.

In this section, we take a look at more advanced methods of optimization and how can we use them to train deep networks. In this lesson, we divide optimization algorithms into two camps:

  • First-order
  • Second-order

First-order optimization algorithms calculate the Jacobian matrix.

The Jacobian

The Jacobian is a matrix of partial derivatives of loss function values with respect to each parameter.

The Jacobian has one partial derivative per parameter (to calculate partial derivatives, all other variables are momentarily treated as constants). The algorithm then takes one step in the direction specified by the Jacobian.

Second-order algorithms calculate the derivative of the Jacobian (i.e., the derivative of a matrix of derivatives) by approximating the Hessian. Second-order methods take into account interdependencies between parameters when choosing how much to modify each parameter.

Second-Order Methods

Second-order methods can take “better” steps; however, each step will take longer to calculate.

Practical Use of Optimization Algorithms

We provide much of the details on optimization algorithms so that you are aware of the mechanics involved for reference.

Other Optimization Algorithms

There are other variations of optimization algorithms (such as “meta heuristics”) that we won’t cover in this lesson. They include the following:

  • Genetic algorithms
  • Particle swarm optimization
  • Ant colony optimization
  • Simulated annealing

First-order methods

The Jacobian, as mentioned, is a matrix of partial derivatives of the loss function with respect to the parameters in the network. In practice, we calculate it at a specific point—the current values of the parameters.

If we think about taking one step at a time to reach an objective, first-order methods calculate a gradient (Jacobian) at each step to determine which direction to go in next. This means that at each iteration, or step, we are trying to find the next best possible direction to go, as defined by our objective function. This is why we consider optimization algorithms to be a “search.” They are finding a path toward minimal error.

Gradient descent is a member of this path-finding class of algorithms. Variations of gradient descent exist, but at its core, it finds the next step in the right direction with respect to an objective at each iteration. Those steps move us toward a global minimum error or maximum likelihood.

Stochastic gradient descent (SGD) is machine learning’s workhorse optimization algorithm. SGD trains several orders of magnitude faster than methods such as batch gradient decent, with no loss of model accuracy.

Why Is SGD Considered “Stochastic”?

This is due to how we calculate the gradient for a single input training example (or mini-batch of training examples). The computed gradient is a “noisy” approximation of the true gradient yet allows SGD to converge faster.

The strengths of SGD are easy implementation and the quick processing of large datasets. You can adjust SGD by adapting the learning rate (e.g., with methods such as Adagrad, discussed shortly) or using second-order information (i.e., the Hessian), as we’ll see next. SGD is also a popular algorithm for training neural networks due to its robustness in the face of noisy updates. That is, it helps you build models that generalize well.

Other Factors in Learning Rate Adjustment

It’s relevant to note that other techniques such as momentum and RMSProp can affect learning rates.

Second-order methods

All second-order methods calculate or approximate the Hessian. As described earlier, we can think of the Hessian as the derivative of the Jacobian. That is, it is a matrix of second-order partial derivatives, analogous to “tracking acceleration rather than speed.” The Hessian’s job is to describe the curvature of each point of the Jacobian. Second-order methods include:

  • Limited-memory BFGS (L-BFGS)9
  • Conjugate gradient10
  • Hessian-free11

Think of these optimization algorithms as a black-box search algorithm that determines the best way to minimize error, given an objective and a defined gradient relative to each layer.

Making Trade-offs in Optimization

A major difference in first- and second-order methods is that second-order methods converge in fewer steps yet take more computation per step.

L-BFGS

L-BFGS is an optimization algorithm and a so-called quasi-Newton method. As its name indicates, it’s a variation of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, and it limits how much gradient is stored in memory. By this, we mean the algorithm does not compute the full Hessian matrix, which is more computationally expensive.

L-BFGS approximates the inverse Hessian matrix to direct weight adjustments search toward more promising areas of parameter space. Whereas BFGS stores the gradient’s full n × n inverse matrix, Hessian L-BFGS stores only a few vectors that represent a local approximation of it. L-BFGS performs faster because it uses approximated second-order information. L-BFGS and conjugate gradient in practice can be faster and more stable than SGD methods.

L-BFGS in Practice

Although L-BFGS has some interesting properties, they are not commonly used in practice for deep networks.

Conjugate gradient

Conjugate gradient guides the direction of the line search process based on conjugacy information. Conjugate gradient methods focus on minimizing the conjugate L2 norm. Conjugate gradient is very similar to gradient descent in that it performs line search. The major difference is that conjugate gradient requires each successive step in the line search process to be conjugate to one another with respect to direction.

Hessian-free

Hessian-free optimization is related to Newton’s method, but it better minimizes the quadratic function we get. It is a powerful optimization method adapted to neural networks by James Martens in 2010. We find the minimum of the quadratic function with an iterative method called conjugate gradient.

Hyperparameters

Here we define a hyperparameter as any configuration setting that is free to be chosen by the user that might affect performance.

Hyperparameters fall into several categories:

  • Layer size
  • Magnitude (momentum, learning rate)
  • Regularization (dropout, drop connect, L1, L2)
  • Activations (and activation function families)
  • Weight initialization strategy
  • Loss functions
  • Settings for epochs during training (mini-batch size)
  • Normalization scheme for input data (vectorization)

In this section, we look at some new hyperparameters relevant to deep learning training.

A Few Cautionary Notes About Hyperparameters

Some hyperparameters apply only some of the time. Moreover, changing a specific hyperparameter might affect the best settings for other hyperparameters. We’d also like to point out that some hyperparameters are incompatible with one another (e.g., Adagrad + momentum).

Layer size

Layer size is defined by the number of neurons in a given layer. Input and output layers are relatively easy to figure out because they correspond directly to how our modeling problem handles input and ouput. For the input layer, this will match up to the number of features in the input vector. For the output layer, this will either be a single output neuron or a number of neurons matching the number of classes we are trying to predict.

Deciding on neuron counts for each hidden layer is where hyperparameter tuning becomes a challenge. We can use an arbitrary number of neurons to define a layer and there are no rules about how big or small this number can be. However, how complex of a problem we can model is directly correlated to how many neurons are in the hidden layers of our networks. This might push you to begin with a large number of neurons from the start but these neurons come with a cost.

Depending on the deep network architecture, the connection schema between layers can vary. However, the weights on the connections are the parameters we must train. As we include more parameters in our model, we increase the amount of effort needed to train the network. Large parameter counts can lead to long training times and models that struggle to find convergence.

Large Parameter Counts and Overfitting

There are also cases in which a larger model will sometimes converge easier because it will simply “memorize” the training data.

Magnitude hyperparameters

Hyperparameters in the magnitude group involve the gradient, step size, and momentum.

Learning rate

The learning rate in machine learning is how fast we change the parameter vector as we move through search space. If the learning rate becomes too high, we can move toward our goal faster (least amount of error for the function being evaluated), but we might also take a step so large that we shoot right past the best answer to the problem, as well.

High Learning Rates and Stability

Another side effect of learning rates that are large is that we run the risk of having unstable training that does not converge over time.

If we make our learning rate too small, it might take a lot longer than we’d like for our training process to complete. A low learning rate can make our learning algorithm inefficient. Learning rates are tricky because they end up being specific to the dataset and even to other hyperparameters. This creates a lot of overhead for finding the right setting for hyperparameters.

We also can schedule learning rates to decrease over time according to some rule.

The Importance of Learning Rate as a Hyperparameter

Learning rate is considered one of the key hyperparameters in neural networks.

Nesterov’s momentum

The “vanilla” version of SGD uses gradient directly, and this can be problematic because gradient can be nearly zero for any parameter. This causes SGD to take tiny steps in some cases, and steps that are too big for situations in which the gradient is too large. To alleviate these issues, we can use techniques such as the following:

  • Nesterov’s momentum
  • RMSProp
  • Adam
  • AdaDelta

DL4J and Updaters

Nesterov’s momentum, RMSProp, Adam, and AdaDelta are known as “updaters” in the terminology of DL4J. Most of the terms used in this lesson are universal for most of all deep learning literature; we just wanted to note this variation for DL4J specifically.

We can speed up our training by increasing momentum, but we might lower the chance that the model will reach minimal error by overshooting the optimal parameter values. Momentum is a factor between 0.0 and 1.0 that is applied to the change rate of the weights over time.  Typically, we see the value for momentum between 0.9 and 0.99.

AdaGrad

AdaGrad12 is one technique that has been developed to help augment finding the “right” learning rate. AdaGrad is named in reference to how it “adaptively” uses subgradient methods to dynamically control the learning rate of an optimization algorithm. AdaGrad is monotonically decreasing and never increases the learning rate above whatever the base learning rate was set at initially.

AdaGrad is the square root of the sum of squares of the history of gradient computations. AdaGrad speeds our training in the beginning and slows it appropriately toward convergence, allowing for a smoother training process.

RMSProp

RMSprop is a very effective, but currently unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites slide 29 of Lecture 6 of Geoff Hinton’s Coursera class.

AdaDelta

AdaDelta13 is a variant of AdaGrad that keeps only the most recent history rather than accumulating it like AdaGrad does.

ADAM

ADAM (a more recently developed updating technique from the University of Toronto) derives learning rates from estimates of first and second moments of the gradients.

Regularization

Let’s dig deeper into the idea of regularization that we touched on in Chapter 1. Regularization is a measure taken against overfitting. Overfitting occurs when a model describes the training set but cannot generalize well over new inputs. Overfitted models have no predictive capacity for data that they haven’t seen. Geoffery Hinton described the best way to build a neural network model:

Cause it to overfit, and then regularize it to death.

Regularization for hyperparameters helps modify the gradient so that it doesn’t step in directions that lead it to overfit. Regularization includes the following:

  • Dropout
  • DropConnect
  • L1 penalty
  • L2 penalty

Dropout and DropConnect mute parts of the input to each layer, such that the neural network learns other portions. Zeroing-out parts of the data causes a neural network to learn more general representations. Regularization works by adding an extra term to the normal gradient computed.

Dropout

Dropout14 is a mechanism used to improve the training of neural networks by omitting a hidden unit. It also speeds training. Dropout is driven by randomly dropping a neuron so that it will not contribute to the forward pass and backpropagation.

Dropout Related to Model Averaging

We also can relate dropout to the concept of averaging the output of multiple models. If we use a dropout coefficient of 0.5, we have the mean of the model. A random Dropout of features is a sampling from 2N possible architectures, where N is the number of parameters.

DropConnect

DropConnect15 does the same thing as Dropout, but instead of choosing a hidden unit, it mutes the connection between two neurons.

L1

The penalty methods L1 and L2, in contrast, are a way of preventing the neural network parameter space from getting too big in one direction. They make large weights smaller.

L1 regularization is considered computationally inefficient in the nonsparse case, has sparse outputs, and includes built-in feature selection. L1 regularization multiplies the absolute value of weights rather than their squares. This function drives many weights to zero while allowing a few to grow large, making it easier to interpret the weights.

L2

In contrast, L2 regularization is computationally efficient due to it having analytical solutions and nonsparse outputs, but it does not do feature selection automatically for us. The “L2” regularization function, a common and simple hyperparameter, adds a term to the objective function that decreases the squared weights. You multiply half the sum of the squared weights by a coefficient called the weight-cost. L2 improves generalization, smooths the output of the model as input changes, and helps the network ignore weights it does not use.

Mini-batching

With mini-batching,16 we send more than one input vector (a group or batch of vectors) to be trained in the learning system. This allows us to use hardware and resources more efficiently at the computer-architecture level. This method also allows us to compute certain linear algebra operations (specifically matrix-to-matrix multiplications) in a vectorized fashion. In this scenario we also have the option of sending the vectorized computations to GPUs if they are present.

Summary

In Chapter 1, we learned about some of the basic regularization tools for feed-forward multilayer neural networks. In this chapter, we expanded this definition to some newer techniques and options for hyperparameters to find better parameter vectors. Let’s now put some of these ideas together to construct building blocks for deep networks.

Building Blocks of Deep Networks

Building deep networks goes beyond basic feed-forward multilayer neural networks. In some cases, deep networks combine smaller networks as building blocks into larger networks; in other cases, they use a specialized set of layers. Here are the specific building blocks we want to highlight:

  • Feed-forward multilayer neural networks
  • RBMs
  • Autoencoders

In Chapter 1, we introduced the canonical feed-forward networks. Inspired by networks of biological neurons, feed-forward networks are the simplest artificial neural networks. They are composed of an input layer, one or many hidden layers, and an output layer. In this section, we introduce networks that are considered building blocks of larger deep networks:

  • RBMs
  • Autoencoders

Both RBMs and autoencoders are characterized by an extra layer-wise step for training. They are often used for the pretraining phase in other larger deep networks.

Unsupervised Layer-Wise Pretraining

Unsupervised layer-wise pretraining17 can help in some training circumstances. Over time, better optimization methods, activation functions, and weight initialization methods have lessened the importance of pretraining-based deep networks. A case for which pretraining becomes interesting is when we have a lot of unlabeled data yet only a relatively smaller set of labeled training data. Pretraining does, however, add extra overhead regarding tuning and extra training time.

Layer-wise pretraining works by performing unsupervised pretraining of the first layer (e.g., RBMs) based on the input data. This gives us the first layer of weights for our main neural network (e.g., feed-forward multilayer perceptron). We perform this process for each layer progressively in the network, using the output of previous layers based on the training input as the input to the successive layers. This pretraining process allows us to initialize the main neural network’s parameters with good initial values.

RBMs model probability and are great at feature extraction. They are feed-forward networks in which data is fed through them in one direction with two biases rather than one bias as in traditional backpropagation feed-forward networks.

Autoencoders are a variant of feed-forward neural networks that have an extra bias for calculating the error of reconstructing the original input. After training, autoencoders are then used as a normal feed-forward neural network for activations. This is an unsupervised form of feature extraction because the neural network uses only the original input for learning weights rather than backpropagation, which has labels. Deep networks can use either RBMs or autoencoders as building blocks for larger networks (it is, however, rare that a single network would use both). In the following sections, we take a closer look at both networks.

RBMs

RBMs are used in deep learning for the following:

  • Feature extraction
  • Dimensionality reduction

The “restricted” part of the name “Restricted Boltzmann Machines” means that connections between nodes of the same layer are prohibited (e.g., there are no visible-visible or hidden-hidden connections along which signal passes). Geoff Hinton, the deep learning pioneer who popularized RBM use almost a decade ago, describes the more general Boltzmann machine as follows:

A network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off.

RBMs are also a type of autoencoder, which we’ll talk about in a following section. RBMs are used for pretraining layers in larger networks such as Deep Belief Networks.

Network layout

There are five main parts of a basic RBM:

  • Visible units
  • Hidden units
  • Weights
  • Visible bias units
  • Hidden bias units

A standard RBM has a visible layer and a hidden layer, as shown in Figure 2-3. We can also see a graph of weights (connections) between the hidden and visible units in the figure. Think of these weights in the same way you think of weights in the classical neural network sense.

Figure 2-3. RBM network

With RBMs, every visible unit is connected to every hidden unit, yet no units from the same layer are connected. Each layer of an RBM can be imagined as a row of nodes. The nodes of the visible and hidden layers are connected by connections with associated weights.

Visible and hidden layers

In an RBM, every single node of the input (visible) layer is connected by weights to every single node of the hidden layer, but no two nodes of the same layer are connected. The second layer is known as the “hidden” layer. Hidden units are feature detectors, learning features from the input data. Nodes in each layer are biologically inspired as with feed-forward multilayer neural network. Units (nodes) in the visible layer are “observable” in that they take training vectors as input. Each layer has a bias unit with its state always set to on.

Each node performs computation based on the input to the node and outputs a result based on a stochastic decision whether or not to transmit data through an activation. Just as with an artificial neuron, the activation computation is based on weights on the connections and the input values. The initial weights are randomly generated.​

 Connections and weights

All connections are visible-hidden; none are visible-visible or hidden-hidden. The edges represent connections along which signals are passed. Loosely speaking, those circles, or nodes, act like human neurons. They are decision units. They make decisions about whether to be on or off through acts of computation. “On” means that they pass a signal further through the net; “off” means that they don’t.

Usually, being “on” means the data passing through the node is valuable; it contains information that will help the network make a decision. Being “off” means the network thinks that particular input is irrelevant noise. A network comes to know which features/signals correlate with which labels (which code contains which messages) by being trained. With training, networks learn to make accurate classifications of the input they receive.

Biases

There is a set of bias weights (“parameters”) connecting the bias unit for each layer to every unit in the layer. Bias nodes help the network better triage and model cases in which an input node is always on or always off.

Training

The technique known as pretraining using RBMs means teaching it to reconstruct the original data from a limited sample of that data. That is, given a chin, a trained network could approximate (or “reconstruct”) a face. RBMs learn to reconstruct the input dataset. We’ll review the concept of reconstruction in the next section.

Contrastive Divergence

RBMs calculate gradients by using an algorithm called contrastive divergence. Contrastive divergence is the name of the algorithm used in sampling for the layer-wise pretraining of a RBM. Also called CD-k, contrastive divergence minimizes the KL divergence (the delta between the real distribution of the data and the guess) by sampling k steps of a Markov chain to compute a guess.

Reconstruction

Deep neural networks with unsupervised pretraining (RBMs, autoencoders) perform feature engineering from unlabeled data through reconstruction. In pretraining, the weights learned through unsupervised pretrain learning are used for weight initialization in networks such as Deep Belief Networks.

Reconstruction as Matrix Factorization

Reconstruction is a matrix factorization problem (also known as matrix decomposition).

Figure 2-4 presents a visual explanation of the network involved in reconstruction in RBMs.

Figure 2-4. Reconstruction in RBMs

We can visually explain reconstruction in RBMs by looking at the MNIST dataset. MNIST stands for the mixed National Institute of Standards and Technology dataset that contains the images. The MNIST dataset is a collection of images representing the handwritten numerals 0 through 9. Figure 2-5 depicts a sample of some of the handwritten digits in MNIST.

Figure 2-5. Sample of MNIST digits

The training dataset in MNIST has 60,000 records and the test dataset has 10,000 records. If we use a RBM to learn the MNIST dataset, we can sample from the trained network to see19 how well it can reconstruct the digits. Figure 2-6 shows renderings of MNIST digits being progressively reconstructed with a RBM.

Figure 2-6. Reconstructing MNIST digits with RBMs

If the training data has a normal distribution, most of them cluster around a central mean, or average, and become scarcer the further you stray from that average. It looks like a bell curve. If we know the mean and the variance, or sigma, of normal data, we can reconstruct that curve. But suppose that we don’t know the mean and variance. Those are parameters we then need to guess. Picking them randomly and contrasting the curve they produce with the original data can operate similarly to a loss function. We measure the difference between two probability distributions much like we measure erroneous classifications, adjust our parameters, and try again.

Reconstruction Cross-Entropy

The objective function here is usually reconstruction cross-entropy, or KL divergence (the mathematicians and cryptanalysts Solomon Kullback and Richard Leibler first published a paper on the technique in 1951). “Cross” refers to the comparison between two distributions. “Entropy” is a term from information theory that refers to uncertainty. For example, a normal curve with a wide spread, or variance, also implies more uncertainty about where data points will fall. That uncertainty is called entropy.

Other uses of RBMs

Here are some other places we see RBMs used:

  • Dimensionality reduction
  • Classification
  • Regression
  • Collaborative filtering
  • Topic modeling

Autoencoders

We use autoencoders to learn compressed representations of datasets. Typically, we use them to reduce a dataset’s dimensionality. The output of the autoencoder network is a reconstruction of the input data in the most efficient form.

Similarities to multilayer perceptrons

Autoencoders share a strong resemblance with multilayer perceptron neural networks in that they have an input layer, hidden layers of neurons, and then an output layer. The key difference to note between a multilayer perceptron network diagram (from earlier chapters) and an autoencoder diagram is the output layer in an autoencoder has the same number of units as the input layer does.

Figure 2-7 presents an example of an autoencoder network.

Figure 2-7. Autoencoder network architecture

Beyond the output layer, there a few other differences, which we outline in the next section.

Defining features of autoencoders

Autoencoders differ from multilayer perceptrons in a couple of ways:

  • They use unlabeled data in unsupervised learning.
  • They build a compressed representation of the input data.

Unsupervised learning of unlabeled data

The autoencoder learns directly from unlabeled data. This is connected to the second major difference between multilayer perceptrons and autoencoders.

Learning to reproduce the input data

The goal of a multilayer perceptron network is to generate predictions over a class (e.g., fraud versus not fraud). An autoencoder is trained to reproduce its own input data.

Training autoencoders

Autoencoders rely on backpropagation to update their weights. The main difference between RBMs and the more general class of autoencoders is in how they calculate the gradients.

Common variants of autoencoders

Two important variants of autoencoders to note are compression autoencoders and denoising autoencoders.

Compression autoencoders

This is the architecture depicted in Figure 2-7. The network input must pass through a bottleneck region of the network before being expanded back into the output representation.

Denoising autoencoders

The denoising autoencoder20 is the scenario in which the autoencoder is given a corrupted version (e.g., some features are removed randomly) of the input and the network is forced to learn the uncorrupted output.

Applications of autoencoders

Building a model to represent the input dataset might not sound useful on the surface. However, we’re less interested in the output itself and more interested in the difference between the input and output representations. If we can train a neural network to learn data it commonly “sees,” then this network can also let us know when it’s “seeing” data that is unusual, or anomalous.

Autoencoders as Anomaly Detectors

Autoencoders are commonly used in systems in which we know what the normal data will look like, yet it’s difficult to describe what is anomalous. Autoencoders are good at powering anomaly detection systems.

Variational Autoencoders

A more recent type of autoencoder model is the variational autoencoder (VAE) introduced by Kingma and Welling21 (see Figure 2-8). The VAE is similar to compression and denoising autoencoders in that they are all trained in an unsupervised manner to reconstruct inputs.

However, the mechanisms that the VAEs use to perform training are quite different. In a compression/denoising autoencoder, activations are mapped to activations throughout the layers, as in a standard neural network; comparatively, a VAE uses a probabilistic approach for the forward pass.

Figure 2-8. VAE network architecture

The VAE model assumes that the data x is generated in two steps: (a) a value zip(z) is generated from a prior distribution, and (b) the data instance is generated according to some conditional distribution xip(x|z). Of course, we don’t actually know the values of z, and inferring p(z|x) exactly is generally intractable. To handle this, we let both distributions, p(z|x) and p(x|z), be approximated by neural networks—the encoder and decoder, respectively. For example, if the P(z|x) is Gaussian, the encoder forward-pass activations provide the Gaussian distribution parameters μ and σ2.

Similarly, the distribution parameters for P(x|z) are provided by the decoder forward pass.22 Overall, the network is trained by backpropagation to maximize a lower bound on the marginal likelihood of the training data, logp(x1,...,xN). The VAE model also has been extended to allow unsupervised learning on time series with the variational recurrent autoencoder.23

1 Referring to Lewis Carroll’s Red Queen character from the book Through the Looking Glass; she has to keep running to stay in the same place.

2 Mnih et al. 2013. “Human-level control through deep reinforcement learning.”

3 LeCun et al. 1998. “Gradient-based learning applied to document recognition.”

4 Krizhevsky, Sutskever, and Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.”

5 Hinton, Osindero, and Teh. 2006. “A Fast Learning Algorithm for Deep Belief Nets.”

6 Gatys et al., 2015. “A Neural Algorithm of Artistic Style.”

7 Goodfellow et al. 2014. “Generative Adversarial Networks.”

8 Source: IEEE Spectrum.

9 Le et al. 2011. “On Optimization Methods for Deep Learning.”

10 LeCun et al. 1998. “Efficient BackProp.”

11 Martens. 2010. “Deep learning via Hessian-free optimization.”

12 Duchi, Hazan, and Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.”

13 Zeiler. 2012. “ADADELTA: An Adaptive Learning Rate Method.”

14 Bengio et al. 2015. Deep Learning (In Preparation).

15 Ibid.

16 Bengio. 2012. “Practical recommendations for gradient-based training of deep architectures.”

17 Bengio et al. 2007. “Greedy Layer-Wise Training of Deep Networks.”

18 Krizhevsky, Sutskever, and Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.”

19 Yosinksi and Lipson. 2012. “Visually Debugging Restricted Boltzmann Machine Training with a 3D Example.”

20 Vincent et al. 2010. “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.”

21 Kingma and Welling. 2013. “Auto-Encoding Variational Bayes.”

22 These distribution parameters should not be confused with the trainable network parameters: in practice, they are just network activations used to specify (for example) the mean and variance values for a Gaussian distribution, or the mean value for Bernoulli distribution.

23 Fabius and van Amersfoort. 2014. “Variational Recurrent Auto-Encoders.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.107.85