overfitting.io / learning / neural-networks

How Neural Networks Bend Space

A neural network doesn't "learn patterns." It warps the space your data lives in until the answer becomes obvious. Here's what that means, geometrically.

Start With a Problem You Can See

Imagine two classes of points on a 2D plane — red and blue. You want to draw a boundary that separates them. If the reds are all on the left and blues on the right, a single straight line does the job. This is what a linear classifier does.

But most real problems don't look like that. The reds might be surrounded by blues. They might form spirals. No single straight line could ever separate them.

The key insight: If you can't draw a straight line through the data, change the data until you can. A neural network transforms the input space — stretching, rotating, folding — until a straight line does work.

One Neuron = One Line

A single neuron computes wx+bw \cdot x + b. Geometrically, that's a straight line. The neuron asks: which side does this point fall on?

The weight vector ww is perpendicular to the boundary. The bias bb shifts it away from the origin. One neuron, one line. To separate anything more complex, you need more neurons.

Interactive: A single neuron draws a line
ww·x + b > 0 → bluew·x + b < 0 → red

The orange arrow is the weight vector — it points perpendicular to the decision boundary. The boundary is the line where w·x + b = 0. Drag the sliders to rotate, flip, and shift it.

One Layer: Stretch, Shift, Fold

A full layer applies three operations in sequence:

1. Linear transformation (matrix multiply) — rotates, stretches, and shears the space. Points move, but straight lines stay straight.

2. Bias (add a vector) — shifts everything. Lets the network center the data where the next operation is most useful.

3. Activation function (the nonlinearity) — this is where the magic happens. ReLU sets all negative values to zero: f(x)=max(0,x)f(x) = \max(0, x). Geometrically, this folds space along an axis. Regions collapse onto each other.

Without the fold, depth is useless. A stack of linear transformations is just one big linear transformation. The activation function is what lets each layer do something the previous one couldn't.
Interactive: How a layer transforms space
Input spaceAfter layer (rotate + ReLU)

Left: the original 2D grid, colored by quadrant. Right: after a rotation (linear transform) and ReLU. Toggle ReLU off to see the rotation alone — straight lines stay straight. Turn ReLU on and watch the grid fold: negative values collapse to zero, and the four quadrants crumple together.

Why Depth Matters: Folds on Folds

One fold isn't very interesting — you can crease a piece of paper once without creating much complexity. But fold it again and again, and the number of distinct regions grows exponentially.

A network with nn ReLU neurons in one hidden layer can carve input space into up to O(nd)O(n^d) linear regions. Two layers with nn neurons each: O(n2d)O(n^{2d}). The network's expressive power grows exponentially with depth, not just width.

In practice, each layer builds on the distortions of the previous one:

  • Layer 1 might detect edges — sharp transitions in pixel values
  • Layer 2 combines edges into textures and corners
  • Layer 3 assembles those into parts — eyes, wheels, letters
  • Layer 4 recognizes objects from their parts

Each layer's job is simple: make the next layer's job easier.

Interactive: How depth untangles data

Inner circle = blue class, outer ring = red class. No straight line can separate them. With 1 hidden layer the boundary is rough. With 2–3 layers the network learns to approximate the circular decision boundary by composing folds.

The background color shows the network's confidence: blue = predicts inner class, red = predicts outer class. Watch how the boundary sharpens with depth.

What Learning Looks Like

Training adjusts the weights — the parameters that control how each layer stretches, shifts, and folds. At initialization, the transformations are random and the data stays tangled. As training progresses, the layers learn to untangle it.

If you extract the hidden activations at each layer and plot them, you can literally watch the clusters form:

Interactive: Representations at each layer
Input (2D)

Drag to rotate the 3D view. Press Play to watch the data morph through a 2→3→3→1 network. The extra dimension lets the network lift one class above the other — untangling the concentric circles into cleanly separable clusters.

The network has physically moved the data points away from each other in representation space. That's what learning is.

The Manifold Hypothesis

A 256×256 color image has 196,608 dimensions. But the set of "real photographs" is an incomprehensibly thin slice of that space — almost all points are random noise.

The manifold hypothesis says real data lies on or near a low-dimensional surface embedded in the high-dimensional space. Faces form a manifold. Sentences form a manifold. Speech forms a manifold.

A neural network's job is to learn the shape of that manifold and flatten it out. Take the tangled, curved surface and stretch it until the features you care about become straight coordinates. The space-bending metaphor isn't just an analogy — it's literally what the math is doing.

What This Explains

The geometric view reframes a lot of what you encounter in practice:

  • Overfitting — the network bends space too aggressively, carving out regions so specific to training data that they don't generalize.
  • Regularization (dropout, weight decay) — limits how hard the network can bend. Smaller weights = gentler transformations = smoother boundaries.
  • Transfer learning — the early folds are generic. Pixel → edge → texture transformations are roughly the same whether you're classifying dogs or tumors. You only redo the last folds.
  • Adversarial examples — the folding isn't perfect. A tiny perturbation can push a point across a fold boundary, invisible to us but catastrophic for the network.
  • Embeddings — just intermediate representations. "Semantic similarity" means the network learned a transformation where similar inputs land near each other.
The optimizer series tells you how to navigate the loss surface. This tells you what the loss surface is: a measure of how well your space-bending machine has learned to straighten out the world.

Play with a full neural network and watch it learn decision boundaries in real time.

Open the Neural Network Playground →