Optimizer

Adam · SGD · AdamW
Deep Learning

The algorithm that updates model weights from their gradients — SGD, Adam, AdamW, Lion, and friends.


In one line

The thing that takes gradients and turns them into weight updates.

What it actually means

Given gradients from backprop, an optimizer decides how to change the weights. Plain SGD: w ← w - lr * g. SGD with momentum: keep an exponential moving average of gradients and step in that direction. Adam: track first and second moments of the gradient per parameter and normalize by them, giving you an adaptive per-parameter step size. AdamW: Adam but with weight decay applied directly rather than mixed into the gradient, which fixes a long-standing subtle bug with L2 regularization. Lion and Shampoo are newer options that trade memory and speed in different directions.

Why it matters

AdamW is the default for training transformers. SGD with momentum is still better for many CV models. The optimizer interacts strongly with learning rate schedule, weight decay, and batch size — changing one without thinking about the others is how you ruin a training run. When a loss curve looks wrong, suspect the optimizer settings before the architecture.

Example

optim = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
)

You’ll hear it when

  • Starting any new training run.
  • Reading a training recipe from a paper or model card.
  • Debugging a loss that won’t come down.
  • Comparing AdamW, Lion, and Shampoo on a benchmark.

Related terms

See also