2015 CVPR 2016

Deep Residual Learning for Image Recognition

He, Zhang, Ren, Sun

TL;DR
Adds identity "skip connections" across layers so gradients flow through very deep networks. Made training 150+ layer CNNs practical and won ImageNet 2015.

What it says

Before ResNet, stacking more layers in a CNN eventually made things worse — not from overfitting but from optimization. The authors propose residual blocks: instead of learning H(x), each block learns F(x) = H(x) - x and adds x back at the end. That identity shortcut lets gradients flow directly through hundreds of layers. They train networks up to 152 layers on ImageNet and take first place on the 2015 classification, detection, and segmentation benchmarks.

Why it matters

Residual connections are now a mandatory ingredient in nearly every deep architecture, including transformers (the “residual stream” is a direct descendant). ResNet also made “just make it deeper” a reliable recipe for years, and its backbones are still widely used for transfer learning today.

  • Highway Networks (Srivastava et al, 2015) — earlier gated skip connections.
  • Identity Mappings in Deep Residual Networks (He et al, 2016) — the follow-up that refines where the skip goes.
  • DenseNet (Huang et al, 2016) — takes skip connections to their logical extreme.