Regularization
Anything you add to training that discourages the model from fitting noise — usually a penalty on weight magnitude or randomness in the forward pass.
In one line
Anything you add to training that discourages the model from fitting noise — usually a penalty on weight magnitude or randomness injected into the forward pass.
What it actually means
Classical regularizers add a term to the loss: L2 (weight decay) penalizes the squared norm of the weights, L1 pushes weights toward zero and induces sparsity. Modern deep learning regularizers also include dropout (randomly zero out activations during training), label smoothing (don’t trust your hard labels), and data augmentation (training on perturbed inputs). All of them trade a bit of training fit for better generalization.
Why it matters
Without regularization, large models overfit small datasets immediately. The right regularizer is task-specific: weight decay is the default for pretraining LLMs, dropout is everywhere in vision, augmentation is the difference between a usable and a useless image classifier on small data. Tuning regularization strength is one of the most common knobs in any ML config.
Example
L_total = L_data + λ * Σ w_i^2 # L2 weight decay
A λ of 0.01 shrinks weights toward zero each step in addition to the gradient signal.
You’ll hear it when
- Setting
weight_decayin any optimizer. - Adding dropout layers to a transformer or CNN.
- Discussing why your val loss is bad and someone says “try more regularization”.
- Reading about L1/L2 in a linear or logistic regression context.
- Configuring augmentation pipelines for vision models.