Dropout
A regularization trick that randomly zeros out a fraction of activations during training so the network can't depend on any single neuron.
In one line
Randomly zero out a fraction of activations on each training step so the network learns redundant pathways.
What it actually means
At training time, for each forward pass, each activation in a dropout layer is independently set to zero with probability p (typically 0.1–0.5). The surviving activations are scaled up by 1/(1-p) so the expected value stays the same. At inference time, dropout is off and all neurons fire. The effect is like training an ensemble of many sub-networks that share weights — no single neuron can become load-bearing because it’ll be missing half the time.
Why it matters
Dropout was the regularization workhorse of the 2014–2018 era. It’s less critical now that we have BatchNorm, LayerNorm, weight decay, and much larger datasets, but you’ll still see it in transformer attention (attn_dropout) and FFN blocks. Turning it off at inference is a common bug — always check model.eval() before evaluation.
Example
import torch.nn as nn
block = nn.Sequential(
nn.Linear(768, 3072),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(3072, 768),
nn.Dropout(0.1),
)
You’ll hear it when
- Reading transformer code (
dropout=0.1everywhere). - Debugging a model that behaves differently in train vs eval.
- Tuning regularization strength.
- Reviewing the PyTorch
.train()/.eval()footgun.