Gradient descent is the foundational optimization algorithm behind nearly all of modern machine learning. Whether you are training a simple linear regression or a billion-parameter large language model like GPT, gradient descent is what adjusts the model's weights so its predictions get closer to the truth.
The core idea is surprisingly intuitive. Imagine you are standing on a hilly landscape in thick fog and you want to reach the lowest valley. You can't see far, but you can feel the slope beneath your feet. Gradient descent says: figure out which direction goes downhill the steepest, then take a step in that direction. Repeat until you stop descending.
Mathematically, the gradient of a function is a vector that points in the direction of steepest ascent. To minimize a loss function, you move in the opposite direction of the gradient. The size of each step is controlled by a scalar called the learning rate. The update rule is simple: w = w - lr * gradient, where w represents a model parameter, lr is the learning rate, and the gradient tells you how the loss changes with respect to that parameter.
Over the decades, researchers have proposed many variations of gradient descent. The visualizer above lets you compare four of the most important ones side by side, each with its own strategy for navigating the loss surface.
Stochastic gradient descent is the simplest optimizer. At each step it computes the gradient of the loss with respect to the parameters and moves in the opposite direction, scaled by the learning rate. "Stochastic" means it typically uses a random mini-batch of data rather than the full dataset, which introduces noise but makes each step much cheaper to compute.
The downside of vanilla SGD is that it treats every direction equally. On elongated loss surfaces, it oscillates back and forth across the narrow dimension while making slow progress along the long dimension. It can also get stuck at saddle points where the gradient is near zero.
Momentum addresses SGD's oscillation problem by adding a velocity term. Instead of using only the current gradient, the optimizer maintains a running average of past gradients, much like a ball rolling downhill accumulates speed. When gradients point in a consistent direction, momentum accelerates the optimizer through flat regions and shallow valleys. When gradients oscillate, opposing directions cancel out and dampen the zigzagging.
The momentum coefficient (typically 0.9) controls how much history to retain. Higher values mean more smoothing and faster acceleration, but can also cause the optimizer to overshoot minima.
RMSProp (Root Mean Square Propagation) takes a different approach: instead of adding momentum, it adapts the learning rate independently for each parameter. It maintains a running average of squared gradients for each weight. Parameters that have been receiving large gradients get their effective learning rate reduced, while parameters with small gradients get a boost.
This per-parameter adaptation is extremely useful when different dimensions of the loss surface have very different curvatures. RMSProp handles ill-conditioned surfaces gracefully, which is why it became popular for training recurrent neural networks.
Adam combines the best of both worlds: it maintains both a first moment estimate (the mean of gradients, like momentum) and a second moment estimate (the mean of squared gradients, like RMSProp). Additionally, Adam includes bias correction that compensates for the fact that the moment estimates are initialized at zero and are therefore biased toward zero in early training steps.
Adam is the default optimizer in most deep learning projects today because it works well across a wide range of problems with minimal hyperparameter tuning. Its typical default settings (learning rate 0.001, beta1 0.9, beta2 0.999) are a solid starting point for many architectures.
The learning rate is arguably the single most important hyperparameter in gradient-based optimization. Set it too high and the optimizer will overshoot the minimum, bouncing around wildly or even diverging to infinity. Set it too small and training will crawl, potentially taking thousands of unnecessary steps to converge, or getting trapped in poor local minima.
The "Goldilocks zone" for the learning rate depends on the shape of the loss landscape, the optimizer being used, and even the stage of training. This is why practitioners often use learning rate schedules that change the rate over time: warm-up (start small, ramp up), step decay (reduce at milestones), cosine annealing, or one-cycle policies. The visualizer above lets you experiment with learning rate and immediately see the effect on convergence.
In real neural networks, the loss surface is a high-dimensional landscape with potentially billions of dimensions, one for each parameter. These surfaces are non-convex, meaning they can contain multiple local minima, saddle points, flat plateaus, and sharp ravines. The 3D surfaces in this tool are simplified to two dimensions, but they capture the essential challenges that optimizers face.
Interestingly, research has shown that in very high-dimensional spaces, local minima tend to be close in quality to the global minimum. The bigger practical problem is often saddle points, where the gradient is near zero but the point is not actually a minimum. Adaptive optimizers like Adam and RMSProp handle saddle points better than vanilla SGD because their per-parameter learning rates can detect and escape these flat regions.
Use the interactive visualizer above to watch these optimizers navigate different loss surfaces in real time. Change the surface, adjust learning rates, and see first-hand why optimizer choice and hyperparameter tuning matter so much in machine learning.