Batch Normalization

BatchNorm · BN
Deep Learning

A layer that normalizes activations across a mini-batch so each feature has roughly zero mean and unit variance, then rescales them with learned parameters.


In one line

Normalize each feature across the mini-batch, then rescale with two learned parameters per feature so training is faster and less sensitive to initialization.

What it actually means

For each feature dimension, compute the mean and variance over the current mini-batch, subtract the mean, divide by the standard deviation, then multiply by a learned scale and add a learned bias. At training time you use the batch statistics; at inference you use running averages accumulated during training. The net effect is that each layer sees inputs with a stable distribution, which lets you use higher learning rates and reduces the importance of careful initialization.

Why it matters

BatchNorm was the trick that made very deep CNNs like ResNet trainable without heroic tuning. It still matters on CV models, though transformers mostly use LayerNorm instead because BN behaves badly with small batches and doesn’t play nicely with sequence data.

Example

import torch.nn as nn
layer = nn.Sequential(
    nn.Conv2d(64, 128, 3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
)

You’ll hear it when

  • Training CNNs on ImageNet-style data.
  • Debugging a model whose batch size dropped and accuracy tanked.
  • Comparing BatchNorm vs LayerNorm vs GroupNorm.
  • Exporting to ONNX and hitting eval-mode bugs.

Related terms

See also