Technology Blog Posts by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
pallab_haldar
Active Participant
0 Kudos
478

Today's topic is to different Optimizer function in Deep learning and when it need to be used. I will discuss about the most used optimizer with real scenario where it can be used.

1. ADAMS : Adaptive Moment Estimation algorithm  is build with  combination of the gradient descent with momentum algorithm and the RMSP algorithm. When the data volume is large  with a no of parameters ,the if we want efficient computation with fast speed then we will use ADAMS. It also rectifies vanishing learning rate, high variance. Only problem is it is costly.

pallab_haldar_0-1718904998967.png

2. RMS Prop (Root Mean Square) :  This method used an Exponential Weighted Average to determine the learning rate at time t for each iteration. Suggested to set gamma at 0.95 in formula, as it has been showing good results for most of the cases. We can use it when the data set contains non-stationary objectives or when training recurrent neural networks(RNNs). It has been shown to perform well on tasks where the Adagrad method's performance is compromised due to its continually decreasing learning rates.

pallab_haldar_1-1718906383803.png

3. Gradient Decent : This algorithm calculates based on the first order derivative of a loss function and  calculates that which way the weights should be change to reach a minimum. Through backpropagation, the loss is transferred from one layer to another and the model’s parameters also known as weights are modified depending on the losses so that the loss can be minimized. 

It is the most used optimization algorithm used in linear regression and classification algorithms using back propagation.

pallab_haldar_0-1718907105356.png

pallab_haldar_1-1718907216130.png

 

4. Stochastic Gradient Descent : In this optimization instead of using the entire dataset for each iteration in back propagation process , only a single random training dataset  is selected to calculate the gradient (loss) and update the model parameters ( Weight and Bias). This random selection of dataset  introduces randomness.

When the dataset is large and need to adjust weight and bias using backpropagation method we can use Stochastic Gradient Descent and it will useful . The equation is -

pallab_haldar_0-1718985839418.pngpallab_haldar_1-1718985944489.png

5 Adagrad : In this algorithm we adapts the learning rates of individual parameters by scaling them based on the historical squared gradients of those parameters. These algorithm se different learning rates for each parameter base on iteration. The reason behind the need for different learning rates is that the learning rate for sparse features parameters needs to be higher compare to the dense features parameter because the frequency of occurrence of sparse features is lower.

It is useful for sparse data as it ensures that infrequent features, which are common in text data, receive larger updates, improving model performance.

pallab_haldar_0-1718986776370.png

All the above are mostly used optimizer used in deep learning. I provided only the summarized format which help to understand only.  You can get the details in different sites.

Hope this will help. 

Labels in this area