Adam: A Method for Stochastic Optimization
Kingma, Ba
What it says
Adam maintains exponential moving averages of the gradient (first moment) and its square (second moment). At each step, it normalizes the update by the square root of the second moment, giving each parameter its own effective learning rate. Bias-correction terms account for the zero initialization of the moving averages. The result is an optimizer that works well out of the box across a wide range of tasks with very little tuning.
Why it matters
Adam, and later AdamW (which fixes how weight decay is applied), is the default optimizer for almost all transformer training. It’s forgiving, fast to converge in wall-clock time, and removes a lot of the optimizer-tuning drudgery. For a paper that’s just an optimizer, it’s arguably one of the most cited in machine learning.
Read next
- Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2017) — AdamW.
- On the Convergence of Adam and Beyond (Reddi et al, 2018) — the convergence issues and AMSGrad fix.
- Symbolic Discovery of Optimization Algorithms (Chen et al, 2023) — the Lion optimizer found by search.