
Today's topic is to different Optimizer function in Deep learning and when it need to be used. I will discuss about the most used optimizer with real scenario where it can be used.
1. ADAMS : Adaptive Moment Estimation algorithm is build with combination of the gradient descent with momentum algorithm and the RMSP algorithm. When the data volume is large with a no of parameters ,the if we want efficient computation with fast speed then we will use ADAMS. It also rectifies vanishing learning rate, high variance. Only problem is it is costly.
2. RMS Prop (Root Mean Square) : This method used an Exponential Weighted Average to determine the learning rate at time t for each iteration. Suggested to set gamma at 0.95 in formula, as it has been showing good results for most of the cases. We can use it when the data set contains non-stationary objectives or when training recurrent neural networks(RNNs). It has been shown to perform well on tasks where the Adagrad method's performance is compromised due to its continually decreasing learning rates.
3. Gradient Decent : This algorithm calculates based on the first order derivative of a loss function and calculates that which way the weights should be change to reach a minimum. Through backpropagation, the loss is transferred from one layer to another and the model’s parameters also known as weights are modified depending on the losses so that the loss can be minimized.
It is the most used optimization algorithm used in linear regression and classification algorithms using back propagation.
4. Stochastic Gradient Descent : In this optimization instead of using the entire dataset for each iteration in back propagation process , only a single random training dataset is selected to calculate the gradient (loss) and update the model parameters ( Weight and Bias). This random selection of dataset introduces randomness.
When the dataset is large and need to adjust weight and bias using backpropagation method we can use Stochastic Gradient Descent and it will useful . The equation is -
5 Adagrad : In this algorithm we adapts the learning rates of individual parameters by scaling them based on the historical squared gradients of those parameters. These algorithm se different learning rates for each parameter base on iteration. The reason behind the need for different learning rates is that the learning rate for sparse features parameters needs to be higher compare to the dense features parameter because the frequency of occurrence of sparse features is lower.
It is useful for sparse data as it ensures that infrequent features, which are common in text data, receive larger updates, improving model performance.
All the above are mostly used optimizer used in deep learning. I provided only the summarized format which help to understand only. You can get the details in different sites.
Hope this will help.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
10 | |
5 | |
4 | |
4 | |
4 | |
3 | |
3 | |
3 | |
3 | |
2 |