Backpropagation

backprop
Deep Learning

The algorithm that computes gradients of the loss with respect to every weight in a network by applying the chain rule backwards.


In one line

The algorithm that computes gradients of the loss with respect to every weight in a network by applying the chain rule backwards through the computation graph.

What it actually means

When you do a forward pass, every operation gets recorded into a graph. Backprop walks that graph in reverse, multiplying local derivatives layer by layer to figure out how a small change in each weight would change the loss. Modern frameworks (PyTorch, JAX) build the graph dynamically and call backward for you — you almost never write the math yourself. The expensive part is the memory: you have to keep activations from the forward pass around so the backward pass can use them, which is why training memory dwarfs inference memory.

Why it matters

Backprop is what makes deep learning trainable at all. Without an efficient way to compute gradients for millions or billions of parameters, you’d be doing finite differences and waiting until the heat death of the universe. Every optimizer, every fine-tuning trick, every gradient checkpointing scheme assumes backprop underneath.

Example

import torch
x = torch.tensor([2.0], requires_grad=True)
y = (x ** 3 + 2 * x).sum()  # forward
y.backward()                  # backward
print(x.grad)                 # 3*x^2 + 2 = 14

You’ll hear it when

  • Debugging exploding or vanishing gradients.
  • Profiling training memory and someone mentions activation checkpointing.
  • Implementing a custom autograd op.
  • Reading about LoRA or other parameter-efficient fine-tuning methods.
  • Explaining why training is much more expensive than inference.

Related terms