Choosing activation functions for feedforward neural networks

For simplicity, we have only discussed the sigmoid activation function in context of multilayer feedforward neural networks so far; we used it in the hidden layer as well as the output layer in the multilayer perceptron implementation in Chapter 12, Training Artificial Neural Networks for Image Recognition. Although we referred to this activation function as sigmoid function—as it is commonly called in literature—the more precise definition would be logistic function or negative log-likelihood function. In the following subsections, you will learn more about alternative sigmoidal functions that are useful for implementing multilayer neural networks.

Technically, we could use any function as activation function in multilayer neural networks as long as it is differentiable. We could even use linear activation functions such as in Adaline (Chapter 2, Training Machine Learning Algorithms for Classification). However, in practice, it would not be very useful to use linear activation functions for both hidden and output layers, since we want to introduce nonlinearity in a typical artificial neural network to be able to tackle complex problem tasks. The sum of linear functions yields a linear function after all.

The logistic activation function that we used in the previous chapter probably mimics the concept of a neuron in a brain most closely: we can think of it as probability of whether a neuron fires or not. However, logistic activation functions can be problematic if we have highly negative inputs, since the output of the sigmoid function would be close to zero in this case. If the sigmoid function returns outputs that are close to zero, the neural network would learn very slowly and it becomes more likely that it gets trapped in local minima during training. This is why people often prefer a hyperbolic tangent as activation function in hidden layers. Before we discuss what a hyperbolic tangent looks like, let's briefly recapitulate some of the basics of the logistic function and look at a generalization that makes it more useful for multi-class classification tasks.

Logistic function recap

As we mentioned it in the introduction to this section, the logistic function, often just called the sigmoid function, is in fact a special case of a sigmoid function. We recall from the section on logistic regression in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, that we can use the logistic function to model the probability that sample Logistic function recap belongs to the positive class (class 1) in a binary classification task:

Logistic function recap

Here, the scalar variable Logistic function recap is defined as the net input:

Logistic function recap

Note that Logistic function recap is the bias unit (y-axis intercept, Logistic function recap). To provide a more concrete example, let's assume a model for a two-dimensional data point x and a model with the following weight coefficients assigned to the vector Logistic function recap:

>>> X = np.array([[1, 1.4, 1.5]])
>>> w = np.array([0.0, 0.2, 0.4])

>>> def net_input(X, w):
...     z = X.dot(w)
...     return z

>>> def logistic(z):
...     return 1.0 / (1.0 + np.exp(-z))

>>> def logistic_activation(X, w):
...     z = net_input(X, w)
...     return logistic(z)

>>> print('P(y=1|x) = %.3f' 
...       % logistic_activation(X, w)[0])
P(y=1|x) = 0.707

If we calculate the net input and use it to activate a logistic neuron with those particular feature values and weight coefficients, we get back a value of 0.707, which we can interpret as a 70.7 percent probability that this particular sample Logistic function recap belongs to the positive class. In Chapter 12, Training Artificial Neural Networks for Image Recognition, we used the one-hot encoding technique to compute the values in the output layer consisting of multiple logistic activation units. However, as we will demonstrate with the following code example, an output layer consisting of multiple logistic activation units does not produce meaningful, interpretable probability values:

# W : array, shape = [n_output_units, n_hidden_units+1]
#          Weight matrix for hidden layer -> output layer.
# note that first column (A[:][0] = 1) are the bias units
>>> W = np.array([[1.1, 1.2, 1.3, 0.5],
...               [0.1, 0.2, 0.4, 0.1],
...               [0.2, 0.5, 2.1, 1.9]])

# A : array, shape = [n_hidden+1, n_samples]
#          Activation of hidden layer.
# note that first element (A[0][0] = 1) is the bias unit
>>> A = np.array([[1.0], 
...               [0.1], 
...               [0.3], 
...               [0.7]])

# Z : array, shape = [n_output_units, n_samples]
#          Net input of the output layer.
>>> Z = W.dot(A) 
>>> y_probas = logistic(Z)
>>> print('Probabilities:
', y_probas)
Probabilities:
 [[ 0.87653295]
 [ 0.57688526]
 [ 0.90114393]]

As we can see in the output, the probability that the particular sample belongs to the first class is almost 88 percent, the probability that the particular sample belongs to the second class is almost 58 percent, and the probability that the particular sample belongs to the third class is 90 percent, respectively. This is clearly confusing, since we all know that a percentage should intuitively be expressed as a fraction of 100. However, this is in fact not a big concern if we only use our model to predict the class labels, not the class membership probabilities.

>>> y_class = np.argmax(Z, axis=0)
>>> print('predicted class label: %d' % y_class[0])
predicted class label: 2

However, in certain contexts, it can be useful to return meaningful class probabilities for multi-class predictions. In the next section, we will take a look at a generalization of the logistic function, the softmax function, which can help us with this task.

Estimating probabilities in multi-class classification via the softmax function

The softmax function is a generalization of the logistic function that allows us to compute meaningful class-probabilities in multi-class settings (multinomial logistic regression). In softmax, the probability of a particular sample with net input Estimating probabilities in multi-class classification via the softmax function belongs to the Estimating probabilities in multi-class classification via the softmax functionth class can be computed with a normalization term in the denominator that is the sum of all Estimating probabilities in multi-class classification via the softmax function linear functions:

Estimating probabilities in multi-class classification via the softmax function

To see softmax in action, let's code it up in Python:

>>> def softmax(z): 
...     return np.exp(z) / np.sum(np.exp(z))

>>> def softmax_activation(X, w):
...     z = net_input(X, w)
...     return softmax(z)

>>> y_probas = softmax(Z)
>>> print('Probabilities:
', y_probas)
Probabilities:
 [[ 0.40386493]
 [ 0.07756222]
 [ 0.51857284]]
>>> y_probas.sum()
1.0

As we can see, the predicted class probabilities now sum up to one, as we would expect. It is also notable that the probability for the second class is close to zero, since there is a large gap between Estimating probabilities in multi-class classification via the softmax function and Estimating probabilities in multi-class classification via the softmax function. However, note that the predicted class label is the same as in the logistic function. Intuitively, it may help to think of the softmax function as a normalized logistic function that is useful to obtain meaningful class-membership predictions in multi-class settings.

>>> y_class = np.argmax(Z, axis=0)
>>> print('predicted class label: 
...        %d' % y_class[0])
predicted class label: 2

Broadening the output spectrum by using a hyperbolic tangent

Another sigmoid function that is often used in the hidden layers of artificial neural networks is the hyperbolic tangent (tanh), which can be interpreted as a rescaled version of the logistic function.

Broadening the output spectrum by using a hyperbolic tangent
Broadening the output spectrum by using a hyperbolic tangent

The advantage of the hyperbolic tangent over the logistic function is that it has a broader output spectrum and ranges the open interval (-1, 1), which can improve the convergence of the back propagation algorithm (C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995, pp. 500-501). In contrast, the logistic function returns an output signal that ranges the open interval (0, 1). For an intuitive comparison of the logistic function and the hyperbolic tangent, let's plot the two sigmoid functions:

>>> import matplotlib.pyplot as plt

>>> def tanh(z):
...     e_p = np.exp(z) 
...     e_m = np.exp(-z)
...     return (e_p - e_m) / (e_p + e_m)  

>>> z = np.arange(-5, 5, 0.005)
>>> log_act = logistic(z)
>>> tanh_act = tanh(z)

>>> plt.ylim([-1.5, 1.5])
>>> plt.xlabel('net input $z$')
>>> plt.ylabel('activation $phi(z)$')
>>> plt.axhline(1, color='black', linestyle='--')
>>> plt.axhline(0.5, color='black', linestyle='--')
>>> plt.axhline(0, color='black', linestyle='--')
>>> plt.axhline(-1, color='black', linestyle='--')

>>> plt.plot(z, tanh_act, 
...          linewidth=2, 
...          color='black', 
...          label='tanh')
>>> plt.plot(z, log_act, 
...          linewidth=2, 
...          color='lightgreen', 
...          label='logistic')

>>> plt.legend(loc='lower right')
>>> plt.tight_layout()
>>> plt.show()

As we can see, the shapes of the two sigmoidal curves look very similar; however, the tanh function has 2x larger output space than the logistic function:

Broadening the output spectrum by using a hyperbolic tangent

Note that we implemented the logistic and tanh functions verbosely for the purpose of illustration. In practice, we can use NumPy's tanh function to achieve the same results:

>>>  tanh_act = np.tanh(z)

In addition, the logistic function is available in SciPy's special module:

>>> from scipy.special import expit
>>> log_act = expit(z)

Now that we know more about the different activation functions that are commonly used in artificial neural networks, let's conclude this section with an overview of the different activation functions that we encountered in this book.

Broadening the output spectrum by using a hyperbolic tangent
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.202.209