Demystifying neural networks

A simple neural network is composed of three layers, the Input layer, Hidden layer, and Output layer as shown in the following diagram:

A layer is a conceptual collection of nodes (also called units), which simulate neurons in a biological brain. The input layer represents the input features x and each node is a predictive feature x. The output layer represents the target variable(s). In binary classification, the output layer contains only one node, whose value is the probability of the positive class. In multiclass classification, the output layer consists of n nodes where n is the number of possible classes and the value of each node is the probability of predicting that class. In regression, the output layer contains only one node the value of which is the prediction result. The hidden layer can be considered a composition of latent information extracted from the previous layer. There can be more than one hidden layer. Learning with a neural network with two or more hidden layers is called DL. We will focus on one hidden layer to begin with.

Two adjacent layers are connected by conceptual edges, sort of like the synapses in a biological brain, which transmit signal from one neuron in a layer to another neuron in the next layer. The edges are parameterized by the weights W of the model. For example, W⁽¹⁾in the preceding diagram connects the input and hidden layers and W⁽²⁾ connects the hidden and output layers.

In a standard neural network, data are conveyed only from the input layer to the output layer, through hidden layer(s). Hence, this kind of network is called feed-forward neural network. Basically, logistic regression is a feed-forward neural network with no hidden layer where the output layer connects directly with the input. Neural networks with one or more hidden layer between the input and output layer should be able to learn more about the the underneath relationship between the input data and target.

Suppose input x is of n dimension and the hidden layer is composed of H hidden units, the weight matrix W⁽¹⁾ connecting the input and hidden layer is of size n by H where each column represents the coefficients associating the input with the h-th hidden unit. The output (also called activation) of the hidden layer can be expressed mathematically as follows:

Here f(z) is an activation function. As its name implies, the activation function checks how activated each neuron is simulating the way our brains work. Typical activation functions include the logistic function (more often called the sigmoid function in neural networks) and the tanh function, which is considered a re-scaled version of logistic function, as well as ReLU (short for Rectified Linear Unit), which is often used in DL:

We plot the following three activation functions as follows:

The logistic (sigmoid) function plot is as follows:

The tanh function plot is as follows:

The relu function plot is as follows:

As for the output layer, let's assume there's one output unit (regression or binary classification) and the weight matrix W⁽²⁾ connecting the hidden layer to the output layer is of the size H by 1. In regression, the output can be expressed mathematically as follows (for consistency, we here denote it as a⁽³⁾ instead of y):

So, how can we obtain the optimal weights W = {W⁽¹⁾, W⁽²⁾} of the model? Similar to logistic regression, we learn all weights using gradient descent with the goal of minimizing the MSE cost J(W). The difference is that the gradients ΔW are computed through backpropagation. In a single-layer network, the detailed steps of backpropagation are as follows:

We travel through the network from the input to output and compute the output values a⁽²⁾ of the hidden layer as well as the output layer a⁽³⁾. This is the feedforward step.
For the last layer, we calculate the derivative of the cost function with regards to the input to the output layer:

For the hidden layer, we compute the derivative of the cost function with regards to the input to the hidden layer:

We compute the gradients by applying the chain rule:

We update the weights with the computed gradients and learning rate a:

We repeatedly update all weights by taking these steps with the latest weights until the cost function converges or it goes through enough iterations.

This might not be easy to digest at first glance, so let's implement it from scratch, which will help you to understand neural networks better.

Table of Contents for Demystifying neural networks

Create new playlist

Sign In

Sign Up

Table of Contents for
Demystifying neural networks