Today I am going to discuss about the different activation functions in Neural Networks and when they need to be used. Before that let's have a look at the definition of Activation functions.
Activation functions :
The activation function in the neural network is used to activate the calculation of weight and bias from the input node and generate the output based on different conditions i.e. to introduce linearity or non-linearity. Activation functions can be “ON” or “OFF,” depending on the input.
Non-Linear activation functions :
A. RELU - When to consider it?
The main purpose of the activation function is to introduce the property of nonlinearity into the model. The rectified linear unit (ReLU) activation function introduces the property of nonlinearity to a deep learning model to solves the vanishing gradients issue and allows for backpropagation while simultaneously making it computationally efficient. The neurons will only be deactivated if the output of the linear transformation is less than 0. used in the hidden layers only
The equation for ReLU function is given below -
The advantages of using ReLU :
The limitations :
To avoid problems related with ReLU there are different updated version of ReLU with below -
Solves the Dying ReLU problem using the standard ReLU that makes the neural network.
It is based on a ReLU, but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training.
2. Parametric ReLU Function: A Parametric Rectified Linear Unit or PReLU generalizes the traditional rectified unit with a slope for negative values. Due to negative values it is able to tackles the vanishing gradient problem. A problem with ReLU is that it returns zero for any negative value input.
3. Exponential Linear Units (ELUs) Function :
The exponential linear units (ELUs) function is used to accelerate the training of neural networks (just like ReLU function). The biggest advantage of the ELU function is that it can eliminate the vanishing gradient problem by using identity for positive values and by improving the learning characteristics of the model.
In this activation function the negative values that push the mean unit activation closer to zero, thereby reducing computational complexity and improving the learning speed up.
tanh works good with RNNs, as RNNs suffers from vanishing gradients.
B. SoftMax Function and when to use : It is known as the normalized exponential function and generalization of logistic regression mostly used in multiclass classification problem. It calculates the relative probabilities. That means it uses the value of A1, A2, A3 to determine the final probability value.
Mostly used in mage recognition and Natural Language Processing (NLP) . For instance, in a neural network model predicting types of Flowers, SoftMax would help determine the probability of an image being an Marigold, Rose or Jasmine, Dahlia. This also ensures the sum of these probabilities equals one.
Disadvantages : Suffer from numerical instability due to the use of logarithm function.
C: Swish Activation function and When to use : Not used to much . But you can get all the details on different references.
D. Gaussian Error Linear Unit (GELU) :Not used to much . But you can get all the details on different references.
E. SELU( Scaled Exponential Linear Unit) : Not used to much . But you can get all the details on different references.
F. Tanh Function (Hyperbolic Tangent) :
The tanh function outputs values in the range of -1 to +1. It can deal with negative values more effectively than the sigmoid function, which has a range of 0 to 1. Unlike the sigmoid function, tanh is zero-centered, which means that its output is symmetric around the origin of the coordinate system.
Advantage : The tanh function is extensively used than the sigmoid function since it delivers better training performance for multilayer neural networks. The biggest advantage of the tanh function is that it produces a zero-centered output and supporting the backpropagation process.
Commonly Used: The tanh function has been mostly used for natural language processing and speech recognition tasks.
However, the tanh function, too, has a limitation – just like the sigmoid function, it cannot solve the vanishing gradient problem. Also, the tanh function can only attain a gradient of 1 when the input value is 0 (x is zero). As a result, the function can produce some dead neurons during the computation process.
G : Sigmoid / Logistic Activation Function :
Sigmoid function used to take real value as input and outputs values in the range of 0 to 1.
Sigmoid functions normally work better in the case of classifiers. The sigmoid function mostly used to activate for the output layer of binary classification models.
The main problem of this function is Vanishing gradient problem - gradients that are used to update the network become extremely small or "vanish" when they are backpropagated from the output layers to the earlier layers. It happens outside a certain range value of Sigmoid function.
Hope this will help to identify the proper algorithm to be applied for your scenario.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
12 | |
9 | |
9 | |
7 | |
7 | |
6 | |
5 | |
5 | |
4 | |
4 |