Deep Residual Learning for Image Recognition
He, Zhang, Ren, Sun
What it says
Before ResNet, stacking more layers in a CNN eventually made things worse — not from overfitting but from optimization. The authors propose residual blocks: instead of learning H(x), each block learns F(x) = H(x) - x and adds x back at the end. That identity shortcut lets gradients flow directly through hundreds of layers. They train networks up to 152 layers on ImageNet and take first place on the 2015 classification, detection, and segmentation benchmarks.
Why it matters
Residual connections are now a mandatory ingredient in nearly every deep architecture, including transformers (the “residual stream” is a direct descendant). ResNet also made “just make it deeper” a reliable recipe for years, and its backbones are still widely used for transfer learning today.
Read next
- Highway Networks (Srivastava et al, 2015) — earlier gated skip connections.
- Identity Mappings in Deep Residual Networks (He et al, 2016) — the follow-up that refines where the skip goes.
- DenseNet (Huang et al, 2016) — takes skip connections to their logical extreme.