How Optimizers Work
A visual guide to the algorithms that train neural networks — from vanilla gradient descent to Adam.
Gradient Descent: The Baseline
Every optimizer starts from the same idea: compute the gradient of your loss function, then take a small step in the opposite direction.
is the learning rate — how big a step to take. Too large and you overshoot; too small and training never finishes.
In practice you use mini-batch SGD: compute the gradient on a random subset of data instead of the full dataset. This is noisy but fast, and the noise often helps escape shallow local minima.
Try a high learning rate (≥ 0.6) — notice how it overshoots and oscillates.
Momentum: Learning from Direction
Imagine rolling a ball down a hilly landscape. It doesn't stop and recalculate at every point — it carries velocity from its previous motion. Momentum does the same.
is the velocity — a running average of past gradients. When gradients consistently point the same direction, velocity builds and the optimizer accelerates. When gradients flip (noise), they cancel out.
This is especially powerful in ravines: plain SGD oscillates back and forth across the narrow walls, barely moving forward. Momentum dampens the oscillations and accelerates along the ravine floor. is the standard default.
Both optimizers use the same learning rate. SGD zigzags across the ravine walls; Momentum dampens oscillations and reaches the minimum faster. Try β = 0 to reduce Momentum to plain SGD.
Adaptive Rates: One Size Does Not Fit All
Momentum smooths direction. But every parameter still shares the same learning rate. Consider a word embedding with 50,000 vocabulary entries: on any mini-batch, the common word "the" gets a gradient update — but the rare word "kerfuffle" might not fire for thousands of steps. When it finally does, you want a meaningful update, not a timid one calibrated for "the".
RMSProp
Track a moving average of each parameter's squared gradient, then scale the learning rate by its inverse square root.
Parameters in steep noisy regions (large ) get smaller steps. Parameters in flat consistent regions (small ) get larger steps. It's a cheap approximation of per-direction curvature without ever computing the expensive Hessian matrix.
w₁ is flat (tiny gradients); w₂ is steep (large gradients). SGD is stuck: any lr safe for w₂ leaves w₁ crawling — the step size readout shows SGD using the same value for both.
RMSProp automatically scales each parameter's step by its gradient history. Watch the readout: it quickly gives w₁ a much larger step than w₂.
Try raising SGD lr above 0.25 — w₂ starts to diverge.
Adam
Adam combines momentum with RMSProp. It maintains two running averages:
and are bias-corrected: both are initialised at zero, so early steps would be too small without the correction . By step 50 the correction factor is essentially 1.
The key insight: when gradients are consistent, regardless of gradient magnitude. Adam's effective step is approximately — it self-normalises. This is why a single learning rate works across parameters with wildly different gradient scales, and why Adam is far less sensitive to hyperparameter tuning than SGD.
Gradient magnitudes shrink during training (the loss is converging). SGD's step (orange) follows the raw gradient — large and noisy early, tiny late.
Adam (blue) normalises: m̂/√v̂ ≈ ±1 regardless of gradient scale, so the effective step stays near α (dashed line). Crank up the noise to see Adam's momentum term smooth out the jitter.
AdamW
Adam's per-parameter scaling distorts L2 regularisation applied through the gradient. AdamW decouples weight decay from the gradient update:
The decay term is now independent of the adaptive scaling. AdamW is the default for transformers and large language models.
Putting It All Together
You've seen each optimizer fix a specific weakness of the one before it. Now let's watch them all race on the same surface — same starting point, same learning rate — and see how those fixes compound.
All four optimizers start at the same point with the same learning rate. Watch how each one handles the ill-conditioned surface differently.
Try lr = 0.10 — SGD barely moves along w₁ while Adam converges smoothly. Then try lr = 0.45 — SGD diverges in w₂ while Adam stays stable.
The Optimizer Family Tree
SGD
├── + Momentum (gradient direction smoothing)
│ └── + Nesterov (look-ahead gradient)
└── Adagrad (per-parameter adaptive LR)
└── RMSProp (EMA instead of running sum)
└── Adam (RMSProp + Momentum)
└── AdamW (+ decoupled weight decay)Each step adds one insight to address a specific failure mode of the version before it.
The best way to build intuition is to watch these algorithms run on real loss surfaces.
Open the Gradient Descent Visualizer →