Last Updated : 11 Jul, 2025
Momentum-based gradient optimizers are used to optimize the training of machine learning models. They are more advanced than the classic gradient descent method and helps to accelerate the training process especially for large-scale datasets and deep neural networks.
By incorporating a "momentum" term these optimizers can navigate the loss surface more efficiently leading to faster convergence, reduced oscillations and better overall performance.
Understanding Gradient DescentBefore understanding momentum-based optimizers it’s important to understand the traditional gradient descent method. In gradient descent the model's weights are updated by taking small steps in the direction of the negative gradient of the loss function. Mathematically its updates weight using:
w_{t+1} = w_t - \eta \nabla L(w_t)
Where:
This method works well but it suffers from issues such as slow convergence and the tendency to get stuck in local minima especially in high-dimensional spaces. This is where momentum-based optimization can be useful.
What is Momentum?Momentum is a concept from physics where an object’s motion depends not only on the current force but also on its previous velocity. In the context of gradient optimization it refers to a method that smoothens the optimization trajectory by adding a term that helps the optimizer remember the past gradients.
In mathematical terms the momentum-based gradient descent updates can be described as:
v_{t+1} = \beta v_t + (1 - \beta) \nabla L(w_t)
w_{t+1} = w_t - \eta v_{t+1}
Where:
There are several variations of momentum-based optimizers each with slight modifications to the basic momentum algorithm:
1. Nesterov Accelerated Gradient (NAG)Nesterov momentum is an advanced form of momentum-based optimization. It modifies the update rule by calculating the gradient at the upcoming position rather than the current position of the weights.
The update rule becomes:
v_{t+1} = \beta v_t + \nabla L(w_t - \eta \beta v_t)
w_{t+1} = w_t - \eta v_{t+1}
NAG is considered more efficient than classical momentum because it has a better understanding of the future trajectory, leading to even faster convergence and better performance in some cases.
2. AdaMomentumAdaMomentum combines the concept of adaptive learning rates with momentum. It adjusts the momentum term based on the recent gradients making the optimizer more sensitive to the landscape of the loss function. This can help in fine-tuning the convergence process.
3. RMSProp (Root Mean Square Propagation)Although not strictly a momentum-based optimizer in the traditional sense RMSProp incorporates a form of momentum by adapting the learning rate for each parameter. It’s particularly effective when dealing with non-stationary objectives, such as in training recurrent neural networks (RNNs).
Advantages of Momentum-Based OptimizersRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4