How Optimizers Work

A visual guide to the algorithms that train neural networks — from vanilla gradient descent to Adam.

Gradient Descent: The Baseline

Every optimizer starts from the same idea: compute the gradient of your loss function, then take a small step in the opposite direction.

w \leftarrow w - \alpha \cdot \nabla L(w)

$\alpha$ is the learning rate — how big a step to take. Too large and you overshoot; too small and training never finishes.

In practice you use mini-batch SGD: compute the gradient on a random subset of data instead of the full dataset. This is noisy but fast, and the noise often helps escape shallow local minima.

The core problem: Loss surfaces are rarely smooth bowls. They have narrow ravines, saddle points, and regions where the gradient in one direction is 100× larger than another. A single fixed learning rate can't handle all of this well.

Interactive: SGD on a 1D loss surface

Learning rate: 0.15

Try a high learning rate (≥ 0.6) — notice how it overshoots and oscillates.

Momentum: Learning from Direction

Imagine rolling a ball down a hilly landscape. It doesn't stop and recalculate at every point — it carries velocity from its previous motion. Momentum does the same.

v \leftarrow \beta \cdot v + (1 - \beta) \cdot \nabla L(w)

w \leftarrow w - \alpha \cdot v

$v$ is the velocity — a running average of past gradients. When gradients consistently point the same direction, velocity builds and the optimizer accelerates. When gradients flip (noise), they cancel out.

This is especially powerful in ravines: plain SGD oscillates back and forth across the narrow walls, barely moving forward. Momentum dampens the oscillations and accelerates along the ravine floor. $\beta = 0.9$ is the standard default.

Interactive: SGD vs Momentum in a ravine

Learning rate (both): 0.35Momentum β: 0.88

Both optimizers use the same learning rate. SGD zigzags across the ravine walls; Momentum dampens oscillations and reaches the minimum faster. Try β = 0 to reduce Momentum to plain SGD.

Adaptive Rates: One Size Does Not Fit All

Momentum smooths direction. But every parameter still shares the same learning rate. Consider a word embedding with 50,000 vocabulary entries: on any mini-batch, the common word "the" gets a gradient update — but the rare word "kerfuffle" might not fire for thousands of steps. When it finally does, you want a meaningful update, not a timid one calibrated for "the".

RMSProp

Track a moving average of each parameter's squared gradient, then scale the learning rate by its inverse square root.

v \leftarrow \beta \cdot v + (1-\beta) \cdot (\nabla L)^2

w \leftarrow w - \frac{\alpha}{\sqrt{v + \varepsilon}} \cdot \nabla L

Parameters in steep noisy regions (large $v$ ) get smaller steps. Parameters in flat consistent regions (small $v$ ) get larger steps. It's a cheap approximation of per-direction curvature without ever computing the expensive Hessian matrix.

Interactive: RMSProp vs SGD on an ill-conditioned surface

SGD — step size w₁0.10step size w₂0.10

RMSProp — step size w₁1.32step size w₂0.23

SGD lr: 0.10RMSProp α: 0.30RMSProp β: 0.90

w₁ is flat (tiny gradients); w₂ is steep (large gradients). SGD is stuck: any lr safe for w₂ leaves w₁ crawling — the step size readout shows SGD using the same value for both.

RMSProp automatically scales each parameter's step by its gradient history. Watch the readout: it quickly gives w₁ a much larger step than w₂.

Try raising SGD lr above 0.25 — w₂ starts to diverge.

Adam

Adam combines momentum with RMSProp. It maintains two running averages:

m \leftarrow \beta_1 m + (1-\beta_1)\nabla L \quad \text{(direction)}

v \leftarrow \beta_2 v + (1-\beta_2)(\nabla L)^2 \quad \text{(scale)}

w \leftarrow w - \alpha \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \varepsilon}

$\hat{m}$ and $\hat{v}$ are bias-corrected: both are initialised at zero, so early steps would be too small without the correction $\hat{m} = m/(1-\beta_1^t)$ . By step 50 the correction factor is essentially 1.

The key insight: when gradients are consistent, $\hat{m}/\sqrt{\hat{v}} \approx \pm 1$ regardless of gradient magnitude. Adam's effective step is approximately $\alpha$ — it self-normalises. This is why a single learning rate works across parameters with wildly different gradient scales, and why Adam is far less sensitive to hyperparameter tuning than SGD.

Adam's defaults are unusually robust:

\beta_1{=}0.9,\; \beta_2{=}0.999,\; \varepsilon{=}10^{-8},\; \alpha{=}0.001

. With plain SGD you often need careful learning rate tuning just to get training started. Adam usually reaches a reasonable result out of the box.

Interactive: Adam normalises step size across gradient scales

Gradient noise: 60%

Gradient magnitudes shrink during training (the loss is converging). SGD's step (orange) follows the raw gradient — large and noisy early, tiny late.

Adam (blue) normalises: m̂/√v̂ ≈ ±1 regardless of gradient scale, so the effective step stays near α (dashed line). Crank up the noise to see Adam's momentum term smooth out the jitter.

AdamW

Adam's per-parameter scaling distorts L2 regularisation applied through the gradient. AdamW decouples weight decay from the gradient update:

w \leftarrow w - \alpha \cdot \frac{\hat{m}}{\sqrt{\hat{v}}+\varepsilon} - \alpha \lambda w

The decay term $\alpha \lambda w$ is now independent of the adaptive scaling. AdamW is the default for transformers and large language models.

Putting It All Together

You've seen each optimizer fix a specific weakness of the one before it. Now let's watch them all race on the same surface — same starting point, same learning rate — and see how those fixes compound.

Interactive: The optimizer race

SGD0.64

Momentum0.47

RMSProp0.00

Adam0.04

Learning rate (all): 0.30

All four optimizers start at the same point with the same learning rate. Watch how each one handles the ill-conditioned surface differently.

Try lr = 0.10 — SGD barely moves along w₁ while Adam converges smoothly. Then try lr = 0.45 — SGD diverges in w₂ while Adam stays stable.

The Optimizer Family Tree

SGD
├── + Momentum     (gradient direction smoothing)
│       └── + Nesterov   (look-ahead gradient)
└── Adagrad        (per-parameter adaptive LR)
        └── RMSProp      (EMA instead of running sum)
                └── Adam         (RMSProp + Momentum)
                        └── AdamW        (+ decoupled weight decay)

Each step adds one insight to address a specific failure mode of the version before it.

The best way to build intuition is to watch these algorithms run on real loss surfaces.

Open the Gradient Descent Visualizer →