Learning Rate

Deep Learning

The scalar that controls how big a step the optimizer takes in the direction of the gradient — the single most important training hyperparameter.


In one line

How big a step to take down the loss surface at each optimizer update.

What it actually means

In plain SGD: w ← w - lr * grad. Too small and training crawls; too big and the loss explodes or oscillates. In adaptive optimizers like Adam, the effective per-parameter step is scaled by running gradient statistics, but the base learning rate still dominates behavior. Modern training uses schedules: linear warmup from 0 for the first few hundred steps, then cosine decay to a small final value. Transformer pretraining usually hovers around 1e-4 to 6e-4; fine-tuning tends to be 10x–100x smaller.

Why it matters

If you only tune one hyperparameter, tune this. A bad learning rate will mask everything else — you’ll conclude your architecture or data is bad when it’s actually the step size. The standard sanity check is a learning-rate range test: sweep LR on a log scale and watch where loss drops fastest.

Example

from torch.optim.lr_scheduler import CosineAnnealingLR
optim = torch.optim.AdamW(model.parameters(), lr=3e-4)
sched = CosineAnnealingLR(optim, T_max=num_steps)

You’ll hear it when

  • Debugging training that diverges or stalls.
  • Reading any training recipe — it’s always the headline number.
  • Comparing constant, cosine, and linear schedules.
  • Fine-tuning: “use a smaller LR than pretraining”.

Related terms