Deep belief networks and deep learning

Some of the pioneering advancements in neural networks research in the last decade have opened up a new frontier in machine learning that is generally called by the name deep learning (references 5 and 7 in the References section of this chapter). The general definition of deep learning is, a class of machine learning techniques, where many layers of information processing stages in hierarchical supervised architectures are exploited for unsupervised feature learning and for pattern analysis/classification. The essence of deep learning is to compute hierarchical features or representations of the observational data, where the higher-level features or factors are defined from lower-level ones (reference 8 in the References section of this chapter). Although there are many similar definitions and architectures for deep learning, two common elements in all of them are: multiple layers of nonlinear information processing and supervised or unsupervised learning of feature representations at each layer from the features learned at the previous layer. The initial works on deep learning were based on multilayer neural network models. Recently, many other forms of models have also been used, such as deep kernel machines and deep Q-networks.

Even in previous decades, researchers have experimented with multilayer neural networks. However, two reasons limited any progress with learning using such architectures. The first reason is that the learning of the network parameters is a non-convex optimization problem. Starting from random initial conditions, one gets stuck at local minima during minimization of error. The second reason is that the associated computational requirements were huge. A breakthrough for the first problem came when Geoffrey Hinton developed a fast algorithm for learning a special class of neural networks called deep belief nets (DBN). We will describe DBNs in more detail in later sections. The high computational power requirements were met with the advancement in computing using general purpose graphical processing units (GPGPUs). What made deep learning so popular for practical applications is the significant improvement in accuracy achieved in automatic speech recognition and computer vision. For example, the word error rate in automatic speech recognition of a switchboard conversational speech had reached a saturation of around 40% after years of research.

However, using deep learning, the word error rate reduced dramatically to close to 10% in a matter of a few years. Another well-known example is how deep convolution neural network achieved the least error rate of 15.3% in the 2012 ImageNet Large Scale Visual Recognition Challenge compared to state-of-the-art methods that gave 26.2% as the least error rate (reference 7 in the References section of this chapter).

In this chapter, we will describe one class of deep learning models called deep belief networks. Interested readers may wish to read the book by Li Deng and Dong Yu (reference 9 in the References section of this chapter) for a detailed understanding of various methods and applications of deep learning. We will follow their notations in the rest of the chapter. We will also illustrate the use of DBN with the R package darch.

Restricted Boltzmann machines

A restricted Boltzmann machine (RBM) is a two-layer network (bi-partite graph), in which one layer is a visible layer (v) and the second layer is a hidden layer (h). All nodes in the visible layer and all nodes in the hidden layer are connected by undirected edges, and there no connections between nodes in the same layer:

Restricted Boltzmann machines

An RBM is characterized by the joint distribution of states of all visible units Restricted Boltzmann machines and states of all hidden units Restricted Boltzmann machines given by:

Restricted Boltzmann machines

Here, Restricted Boltzmann machines is called the energy function and Restricted Boltzmann machines is the normalization constant known by the name partition function from Statistical Physics nomenclature.

There are mainly two types of RBMs. In the first one, both v and h are Bernoulli random variables. In the second type, h is a Bernoulli random variable whereas v is a Gaussian random variable. For Bernoulli RBM, the energy function is given by:

Restricted Boltzmann machines

Here, Restricted Boltzmann machines represents the weight of the edge between nodes Restricted Boltzmann machines and Restricted Boltzmann machines; Restricted Boltzmann machines and Restricted Boltzmann machines are bias parameters for the visible and hidden layers respectively. For this energy function, the exact expressions for the conditional probability can be derived as follows:

Restricted Boltzmann machines
Restricted Boltzmann machines

Here, Restricted Boltzmann machines is the logistic function Restricted Boltzmann machines.

If the input variables are continuous, one can use the Gaussian RBM; the energy function of it is given by:

Restricted Boltzmann machines

Also, in this case, the conditional probabilities of Restricted Boltzmann machines and Restricted Boltzmann machines will become as follows:

Restricted Boltzmann machines
Restricted Boltzmann machines

This is a normal distribution with mean Restricted Boltzmann machines and variance 1.

Now that we have described the basic architecture of an RBM, how is it that it is trained? If we try to use the standard approach of taking the gradient of log-likelihood, we get the following update rule:

Restricted Boltzmann machines

Here, Restricted Boltzmann machines is the expectation of Restricted Boltzmann machines computed using the dataset and Restricted Boltzmann machines is the same expectation computed using the model. However, one cannot use this exact expression for updating weights because Restricted Boltzmann machines is difficult to compute.

The first breakthrough came to solve this problem and, hence, to train deep neural networks, when Hinton and team proposed an algorithm called Contrastive Divergence (CD) (reference 7 in the References section of this chapter). The essence of the algorithm is described in the next paragraph.

The idea is to approximate Restricted Boltzmann machines by using values of Restricted Boltzmann machines and Restricted Boltzmann machines generated using Gibbs sampling from the conditional distributions mentioned previously. One scheme of doing this is as follows:

  1. Initialize Restricted Boltzmann machines from the dataset.
  2. Find Restricted Boltzmann machines by sampling from the conditional distribution Restricted Boltzmann machines.
  3. Find Restricted Boltzmann machines by sampling from the conditional distribution Restricted Boltzmann machines.
  4. Find Restricted Boltzmann machines by sampling from the conditional distribution Restricted Boltzmann machines.

Once we find the values of Restricted Boltzmann machines and Restricted Boltzmann machines, use Restricted Boltzmann machines, which is the product of ith component of Restricted Boltzmann machines and jth component of Restricted Boltzmann machines, as an approximation for Restricted Boltzmann machines. This is called CD-1 algorithm. One can generalize this to use the values from the kth step of Gibbs sampling and it is known as CD-k algorithm. One can easily see the connection between RBMs and Bayesian inference. Since the CD algorithm is like a posterior density estimate, one could say that RBMs are trained using a Bayesian inference approach.

Although the Contrastive Divergence algorithm looks simple, one needs to be very careful in training RBMs, otherwise the model can result in overfitting. Readers who are interested in using RBMs in practical applications should refer to the technical report (reference 10 in the References section of this chapter), where this is discussed in detail.

Deep belief networks

One can stack several RBMs, one on top of each other, such that the values of hidden units in the layer Deep belief networks would become values of visible units in the nth layer Deep belief networks, and so on. The resulting network is called a deep belief network. It was one of the main architectures used in early deep learning networks for pretraining. The idea of pretraining a NN is the following: in the standard three-layer (input-hidden-output) NN, one can start with random initial values for the weights and using the backpropagation algorithm, can find a good minimum of the log-likelihood function. However, when the number of layers increases, the straightforward application of backpropagation does not work because starting from output layer, as we compute the gradient values for the layers deep inside, their magnitude becomes very small. This is called the gradient vanishing problem. As a result, the network will get trapped in some poor local minima. Backpropagation still works if we are starting from the neighborhood of a good minimum. To achieve this, a DNN is often pretrained in an unsupervised way, using a DBN. Instead of starting from random values of weights, train a DBN in an unsupervised way and use weights from the DBN as initial weights for a corresponding supervised DNN. It was seen that such DNNs pretrained using DBNs perform much better (reference 8 in the References section of this chapter).

The layer-wise pretraining of a DBN proceeds as follows. Start with the first RBM and train it using input data in the visible layer and the CD algorithm (or its latest better variants). Then, stack a second RBM on top of this. For this RBM, use values sample from Deep belief networks as the values for the visible layer. Continue this process for the desired number of layers. The outputs of hidden units from the top layer can also be used as inputs for training a supervised model. For this, add a conventional NN layer at the top of DBN with the desired number of classes as the number of output nodes. Input for this NN would be the output from the top layer of DBN. This is called DBN-DNN architecture. Here, a DBN's role is generating highly efficient features (the output of the top layer of DBN) automatically from the input data for the supervised NN in the top layer.

The architecture of a five-layer DBN-DNN for a binary classification task is shown in the following figure:

Deep belief networks

The last layer is trained using the backpropagation algorithm in a supervised manner for the two classes Deep belief networks and Deep belief networks. We will illustrate the training and classification with such a DBN-DNN using the darch R package.

The darch R package

The darch package, written by Martin Drees, is one of the R packages using which one can begin doing deep learning in R. It implements the DBN described in the previous section (references 5 and 7 in the References section of this chapter). The package can be downloaded from https://cran.r-project.org/web/packages/darch/index.html.

The main class in the darch package implements deep architectures and provides the ability to train them with Contrastive Divergence and fine-tune with backpropagation, resilient backpropagation, and conjugate gradients. The new instances of the class are created with the newDArch constructor. It is called with the following arguments: a vector containing the number of nodes in each layers, the batch size, a Boolean variable to indicate whether to use the ff package for computing weights and outputs, and the name of the function for generating the weight matrices. Let us create a network having two input units, four hidden units, and one output unit:

install.packages("darch") #one time
>library(darch)
>darch <- newDArch(c(2,4,1),batchSize = 2,genWeightFunc = generateWeights)
INFO [2015-07-19 18:50:29] Constructing a darch with 3 layers.
INFO [2015-07-19 18:50:29] Generating RBMs.
INFO [2015-07-19 18:50:29] Construct new RBM instance with 2 visible and 4 hidden units.
INFO [2015-07-19 18:50:29] Construct new RBM instance with 4 visible and 1 hidden units.

Let us train the DBN with a toy dataset. We are using this because for training any realistic examples, it would take a long time: hours, if not days. Let us create an input data set containing two columns and four rows:

>inputs <- matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE)
>outputs <- matrix(c(0,1,1,0),nrow=4)

Now, let us pretrain the DBN, using the input data:

>darch <- preTrainDArch(darch,inputs,maxEpoch=1000)

We can have a look at the weights learned at any layer using the getLayerWeights( ) function. Let us see how the hidden layer looks:

>getLayerWeights(darch,index=1)
[[1]]
          [,1]        [,2]       [,3]       [,4]
[1,]   8.167022    0.4874743  -7.563470  -6.951426
[2,]   2.024671  -10.7012389   1.313231   1.070006
[3,]  -5.391781    5.5878931   3.254914   3.000914

Now, let's do a backpropagation for supervised learning. For this, we need to first set the layer functions to sigmoidUnitDerivatives:

>layers <- getLayers(darch)
>for(i in length(layers):1){
     layers[[i]][[2]] <- sigmoidUnitDerivative
    }
>setLayers(darch) <- layers
>rm(layers)

Finally, the following two lines perform the backpropagation:

>setFineTuneFunction(darch) <- backpropagation
>darch <- fineTuneDArch(darch,inputs,outputs,maxEpoch=1000)

We can see the prediction quality of DBN on the training data itself by running darch as follows:

>darch <- getExecuteFunction(darch)(darch,inputs)
>outputs_darch <- getExecOutputs(darch)
>outputs_darch[[2]]
        [,1]
[1,] 9.998474e-01
[2,] 4.921130e-05
[3,] 9.997649e-01
[4,] 3.796699e-05

Comparing with the actual output, DBN has predicted the wrong output for the first and second input rows. Since this example was just to illustrate how to use the darch package, we are not worried about the 50% accuracy here.

Other deep learning packages in R

Although there are other deep learning packages in R, such as deepnet and RcppDL, compared with libraries in other languages such as Cuda (C++) and Theano (Python), R yet does not have good native libraries for deep learning. The only available package is a wrapper for the Java-based deep learning open source project H2O. This R package, h2o, allows running H2O via its REST API from within R. Readers who are interested in serious deep learning projects and applications should use H2O using h2o packages in R. One needs to install H2O in your machine to use h2o. We will cover H2O in the next chapter when we discuss Big Data and the distributed computing platform called Spark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.36.38