Regularization

Classical ML

Anything you add to training that discourages the model from fitting noise — usually a penalty on weight magnitude or randomness in the forward pass.


In one line

Anything you add to training that discourages the model from fitting noise — usually a penalty on weight magnitude or randomness injected into the forward pass.

What it actually means

Classical regularizers add a term to the loss: L2 (weight decay) penalizes the squared norm of the weights, L1 pushes weights toward zero and induces sparsity. Modern deep learning regularizers also include dropout (randomly zero out activations during training), label smoothing (don’t trust your hard labels), and data augmentation (training on perturbed inputs). All of them trade a bit of training fit for better generalization.

Why it matters

Without regularization, large models overfit small datasets immediately. The right regularizer is task-specific: weight decay is the default for pretraining LLMs, dropout is everywhere in vision, augmentation is the difference between a usable and a useless image classifier on small data. Tuning regularization strength is one of the most common knobs in any ML config.

Example

L_total = L_data + λ * Σ w_i^2     # L2 weight decay

A λ of 0.01 shrinks weights toward zero each step in addition to the gradient signal.

You’ll hear it when

  • Setting weight_decay in any optimizer.
  • Adding dropout layers to a transformer or CNN.
  • Discussing why your val loss is bad and someone says “try more regularization”.
  • Reading about L1/L2 in a linear or logistic regression context.
  • Configuring augmentation pipelines for vision models.

Related terms