Learning Rate
The scalar that controls how big a step the optimizer takes in the direction of the gradient — the single most important training hyperparameter.
In one line
How big a step to take down the loss surface at each optimizer update.
What it actually means
In plain SGD: w ← w - lr * grad. Too small and training crawls; too big and the loss explodes or oscillates. In adaptive optimizers like Adam, the effective per-parameter step is scaled by running gradient statistics, but the base learning rate still dominates behavior. Modern training uses schedules: linear warmup from 0 for the first few hundred steps, then cosine decay to a small final value. Transformer pretraining usually hovers around 1e-4 to 6e-4; fine-tuning tends to be 10x–100x smaller.
Why it matters
If you only tune one hyperparameter, tune this. A bad learning rate will mask everything else — you’ll conclude your architecture or data is bad when it’s actually the step size. The standard sanity check is a learning-rate range test: sweep LR on a log scale and watch where loss drops fastest.
Example
from torch.optim.lr_scheduler import CosineAnnealingLR
optim = torch.optim.AdamW(model.parameters(), lr=3e-4)
sched = CosineAnnealingLR(optim, T_max=num_steps)
You’ll hear it when
- Debugging training that diverges or stalls.
- Reading any training recipe — it’s always the headline number.
- Comparing constant, cosine, and linear schedules.
- Fine-tuning: “use a smaller LR than pretraining”.