Vanishing Gradient

Deep Learning

When gradients shrink exponentially as they propagate back through many layers, so early layers barely update and the network won't train.


In one line

Backprop multiplies a chain of small numbers; in a deep network that product goes to zero and early layers stop learning.

What it actually means

Backprop computes gradients by applying the chain rule layer by layer. If each layer’s local gradient is less than 1, the product shrinks exponentially with depth. Saturating activations (sigmoid, tanh) are a classic culprit — their derivative is at most 0.25, so a 20-layer tanh network has gradients around 10^-12 at the input. RNNs suffer the same problem along the time axis. The fix is a mix of tricks: ReLU-family activations, residual connections (which provide a gradient shortcut), normalization layers (BatchNorm, LayerNorm), careful initialization (He, Xavier), and architectures like LSTMs that gate the flow.

Why it matters

This is the reason deep networks were considered untrainable before 2012. Every piece of the modern deep-learning toolkit — ReLU, ResNets, BatchNorm, transformers with residual streams — exists in part to keep gradients alive. When training a custom architecture and the loss won’t move, this is the first thing to check.

Example

σ'(x) = σ(x)(1 - σ(x)) ≤ 0.25
20 sigmoid layers → gradient product ≤ (0.25)^20 ≈ 1e-12

You’ll hear it when

  • Debugging a deep custom network that won’t train.
  • Explaining why residual connections work.
  • Reading history of deep learning (“why RNNs lost”).
  • Choosing activations or normalization strategies.

Related terms