How it works...

Following are the explanations of the functions:

Threshold activation function was used by McCulloch Pitts Neuron and initial Perceptrons. It is not differentiable and is discontinuous at x=0. Therefore, it is not possible to use this activation function to train using gradient descent or its variants.
Sigmoid activation function was very popular at one time. If you look at the curve, it looks like a continuous version of the threshold activation function. It suffers from the vanishing gradient problem, that is, the gradient of the function becomes zero near the two edges. This makes training and optimization difficult.
Hyperbolic Tangent activation function is again sigmoidal in shape and has nonlinear properties. The function is centered at zero and has steeper derivatives compared to sigmoid. Like sigmoid, this also suffers from the vanishing gradient problem.
Linear activation function is, as the name suggests, linear in nature. The function is unbounded from both sides [-inf, inf]. Its linearity is its major problem. The sum of linear functions will be a linear function and the linear function of a linear function too is a linear function. Thus, using this function, one cannot grasp the non-linearities present in complex datasets.
ReLU activation function is the rectified version of the linear activation function, and this rectification allows it to capture non-linearities when used in multiple layers. One of the major advantages of using ReLU is that it leads to sparse activation; at any instant, all the neurons with negative activity are not firing. This makes the network lighter in terms of computation. ReLU neurons suffer from the dying ReLU problem, that is, neurons that do not fire will have their gradients zero and, hence, will not be able to undergo any training and stay off (dead). Despite this problem, ReLU today is one of the most used activation functions for hidden layers.
Softmax activation function is popularly used as the activation function of the output layer. The function is bounded in the range [0,1]. It is used to represent the probability of a class in a multiclass classification problem. The sum of the output of all the units will always be 1.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...