2015 ICML 2015

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioffe, Szegedy

TL;DR
Normalize each layer's activations over the mini-batch, then rescale with learned parameters. Dramatically speeds up training and reduces sensitivity to initialization.

What it says

The authors argue that the distribution of each layer’s inputs shifts during training (“internal covariate shift”) and slows down learning. Their fix: for each mini-batch, normalize each feature to zero mean and unit variance, then apply a learned scale and shift. At inference time, running averages from training are used. They show faster convergence, higher learning rates, and a mild regularization effect on ImageNet and other benchmarks.

Why it matters

BatchNorm became standard in CNNs for years and remains the default in most CV architectures. The “internal covariate shift” framing has since been challenged (the real mechanism is closer to smoothing the loss landscape), but the empirical result held. It also inspired the whole normalization family — LayerNorm, GroupNorm, InstanceNorm, RMSNorm — that keeps modern transformers trainable.

  • Layer Normalization (Ba et al, 2016) — the variant that powers transformers.
  • Group Normalization (Wu & He, 2018) — stable at small batch sizes.
  • How Does Batch Normalization Help Optimization? (Santurkar et al, 2018) — the loss-landscape explanation.