Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, Szegedy
What it says
The authors argue that the distribution of each layer’s inputs shifts during training (“internal covariate shift”) and slows down learning. Their fix: for each mini-batch, normalize each feature to zero mean and unit variance, then apply a learned scale and shift. At inference time, running averages from training are used. They show faster convergence, higher learning rates, and a mild regularization effect on ImageNet and other benchmarks.
Why it matters
BatchNorm became standard in CNNs for years and remains the default in most CV architectures. The “internal covariate shift” framing has since been challenged (the real mechanism is closer to smoothing the loss landscape), but the empirical result held. It also inspired the whole normalization family — LayerNorm, GroupNorm, InstanceNorm, RMSNorm — that keeps modern transformers trainable.
Read next
- Layer Normalization (Ba et al, 2016) — the variant that powers transformers.
- Group Normalization (Wu & He, 2018) — stable at small batch sizes.
- How Does Batch Normalization Help Optimization? (Santurkar et al, 2018) — the loss-landscape explanation.