overfitting.io / learning / activations

Activation Functions

Without nonlinearity, a 100-layer network is no better than a single matrix multiply. Activation functions are the one ingredient that makes depth useful. Here's how each one works, what goes wrong, and why the field keeps inventing new ones.

Why Nonlinearity Matters

A neural network layer computes z=Wx+bz = Wx + b then applies an activation function a=f(z)a = f(z). Without ff, stacking two layers gives you W2(W1x+b1)+b2=W2W1x+(W2b1+b2)W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + (W_2 b_1 + b_2). That's just another linear transformation. A 100-layer linear network collapses to a single matrix multiply.

The activation function breaks this collapse. It introduces a nonlinearity between layers, so each layer can do something the previous one couldn't. What that nonlinearity looks like turns out to matter a lot.

The Zoo

There are dozens of activation functions in the literature, but in practice you'll encounter six. Each one has a different shape, different derivative, and different failure modes.

Interactive: Activation function explorer
-4-2024-1012f(x)f'(x)
ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)
ReLU(x)={1x>00x0\text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}

Click each activation to see its shape and derivative. The derivative matters because it controls how gradients flow backward during training. A derivative near zero means gradients vanish; a constant derivative (like ReLU's) means gradients pass through unchanged.

Sigmoid

The sigmoid was the original activation function, borrowed from logistic regression. It squashes any input into (0,1)(0, 1), which has a nice probabilistic interpretation. Historically, this made it popular for output layers in binary classification.

The problem is in the derivative. The maximum value of σ(x)\sigma'(x) is 0.25, at x=0x = 0. At x=±5x = \pm 5, the derivative is already below 10410^{-4}. This means that during backpropagation, gradients get multiplied by a number less than 0.25 at every layer. In a 10-layer network, the gradient reaching the first layer is at most 0.25101060.25^{10} \approx 10^{-6} of what it was at the output. The early layers stop learning. This is the vanishing gradient problem.

Tanh

Tanh is a shifted, scaled sigmoid: tanh(x)=2σ(2x)1\tanh(x) = 2\sigma(2x) - 1. Its output is zero-centered ((1,1)(-1, 1) instead of (0,1)(0, 1)), which helps because gradients don't all push in the same direction. Its derivative peaks at 1.0 instead of 0.25, so vanishing gradients are less severe. But it still saturates for large inputs. In practice, tanh dominated the 2000s but is now mostly used in recurrent networks (LSTMs, GRUs) where its bounded output helps keep hidden states stable.

ReLU

ReLU changed everything. ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x). For positive inputs the derivative is exactly 1. No shrinking, no saturation, no vanishing gradient. Gradients pass through unchanged. For negative inputs the output and derivative are both 0.

This simplicity is a feature. ReLU is computationally cheap (one comparison), creates sparse activations (many neurons output exactly zero), and trains deep networks that sigmoid/tanh simply could not.

But the zero region introduces its own problem: dying neurons. If a neuron's input becomes permanently negative (say, after a bad gradient update), its output is always 0, its derivative is always 0, and it never recovers. The neuron is dead. In practice, a noticeable fraction of neurons in a large ReLU network can die during training, permanently reducing the network's capacity.

Interactive: Gradient magnitude through layers
L1L2L3L4L5L6Out← gradient flows backwardSigmoidReLUGELU

Bar height = gradient magnitude at each layer. Gradients flow backward from output (right) to input (left). Sigmoid's derivative maxes out at 0.25, so gradients shrink by at least 4× per layer. By layer 6, the first layers barely learn. ReLU passes gradients through at full strength (derivative = 1 for positive inputs).

Leaky ReLU

The fix for dying neurons is straightforward: instead of outputting exactly 0 for negative inputs, output a small fraction of the input. LeakyReLU(x)=max(αx,x)\text{LeakyReLU}(x) = \max(\alpha x, x) with a small α\alpha (typically 0.01). Now the derivative is never exactly zero. Dead neurons can recover.

Parametric ReLU (PReLU) takes this further by making α\alpha a learnable parameter. The network decides how leaky each neuron should be.

GELU

ReLU makes a hard binary decision: if the input is positive, keep it; if negative, kill it. GELU replaces that hard gate with a soft, probabilistic one: GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x), where Φ(x)\Phi(x) is the CDF of the standard normal distribution.

The intuition: instead of asking "is this input positive?", GELU asks "how likely is this input to be positive, given that inputs are normally distributed?" Large positive values pass through almost unchanged. Large negative values are nearly zeroed. Values near zero are partially attenuated, proportional to how uncertain their sign is.

This smoothness matters. GELU has a continuous, smooth derivative everywhere (unlike ReLU's hard kink at zero). This gives the optimizer smoother loss surfaces to navigate. GELU is the standard activation in BERT, GPT, and most modern transformers.

Swish

Swish is Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x). It was discovered through automated architecture search at Google. It looks very similar to GELU (both are smooth, both allow small negative outputs, both converge to ReLU for large positive inputs). In practice, the two perform similarly. Swish is common in vision models (EfficientNet) while GELU dominates in language models.

What ReLU Actually Computes

Here's something concrete that makes ReLU networks easier to reason about. A single-layer ReLU network with nn hidden neurons computes a piecewise linear function. Each neuron contributes one "hinge" where the function can change slope. With nn neurons, you get at most n+1n + 1 linear pieces.

This means a ReLU network isn't doing anything mysterious. It's stitching together line segments (or, in higher dimensions, flat patches). More neurons = more segments = finer approximation. The universal approximation theorem guarantees that with enough neurons, this piecewise linear function can approximate any continuous function arbitrarily well.

Interactive: ReLU neurons as piecewise linear approximation
TargetReLU network (4n)

The dashed line is a smooth target function. The blue line is a trained single-layer ReLU network. Each neuron contributes one "kink" (faint vertical lines) where the approximation can change slope. With 1–2 neurons, the fit is crude. By 10–15, it tracks the curve closely. This is the universal approximation theorem in action.

Deep ReLU networks extend this. Each layer multiplies the number of possible linear regions. A network with LL layers of nn neurons can create up to O(nL)O(n^L) distinct linear regions, exponentially more than a single layer. That's the computational advantage of depth.

Sharp vs Smooth Boundaries

ReLU's piecewise linearity means its decision boundaries have hard corners wherever a neuron switches on or off. GELU and Swish produce smoother boundaries because their transitions are gradual.

In practice, smoother boundaries tend to generalize better. Sharp corners can overfit to noise in the training data. This is one reason modern architectures have moved toward smooth activations for their hidden layers, even though ReLU remains competitive for many tasks.

ReLU vs GELU: Decision boundaries on the same data
ReLUGELU

Same data, same architecture, same initialization. ReLU produces a piecewise linear boundary with sharp corners. GELU produces smoother boundaries. In practice, smooth boundaries often generalize better to new data.

See how activation functions shape what networks can learn, live.

Open the Neural Network Playground →