Activation functions

Picking good activation functions makes training much easier. Also, the activation function choice may entail at least two shoulds: the way you should transform data and the way binaries (if there are any) should be formatted. There is an infinity of activation functions available; actually, it could be any continuous function—you can make one of your own.

The only requirement for a function to be eligible as an activation function is to be derivable, or at least that you can assume a reasonable proxy for the points you can't derive.

Here is are a list of popular activation functions:

Rectified Linear Unit (ReLU):

Leaky Rectified Linear Unit (Leaky ReLU):

Hyperbolic Tangent (Tanh):

Sigmoid:

People that are used to time-series modeling are usually into things such as modeling the series in the differences, instead of in a level. That's because they frequently rely upon linear methods that demand stationarity, and if series is not stationary in level, it might be in the first difference.

Transformations such as difference could help in specific cases but are not what is generally looked for before the ANN training starts (preprocessing stage). To understand why it's good to look at the activation function form, the following shows the ReLU, Tanh, and Sigmoid response to inputs between the -3 and 3 range:

Figure 8.3: Popular activation functions behavior

The limits for the ReLU function are between zero and infinite. Tanh goes from -1.0 to 1.0, while Sigmoid goes from zero to one. In a perfect world, the range of your inputs wouldn't matter, and the weights would do a wonderful job by rescaling your variables if required. In the real world, if your input range doesn't match the activation functions that well, you might get stuck in a local optimal during the training.

If you're using Tanh as an activation function, try to use -1.0 and 1.0 as binaries instead of one and zero.

Generally, standardization will help a lot, but there are most certainly alternatives. The activation function that I mostly used when I got started was Tanh, and due to this, the transformation that I am most used to is a max–min type of transformation that coerces my data into a range between -1 and 1, and it makes training a lot smoother. It goes like this:

There are lots of transformations and lots of other kinds of max–min transformations. The most common one will scale data into a range of zero to one.

stands for the transformed variable. It's also good to transform the output if you are dealing with a regression problem. Just don't forget that you might have to redo and undo the transformation later. Store the key numbers (max and min) for each variable someplace; this way, you can freely redo or undo the transformation as it pleases you.

Sigmoid and Tanh were the first ones (in this order) to become popular. Tanh solves some of the problems carried by sigmoid. Yet, ReLU seems to stand out as an activation function, meaning that it should be a safe choice. ReLU is not all perfect; Leaky ReLU solves some problems with regular ReLU.

Maxout is yet another alternative activation function, but it greatly increases the number of parameters, so there is a trade-off. Output nodes usually demand different types of activation functions. Linear activation functions are usually the best choice for regression problems, while softmax tends to outperform classification problems.

A single node with a proper activation function does not make a network. To have an ANN, you must have many nodes arranged in layers.

Table of Contents for Activation functions

Create new playlist

Sign In

Sign Up

Table of Contents for
Activation functions