2014 ICLR 2015

Adam: A Method for Stochastic Optimization

Kingma, Ba

TL;DR
An adaptive optimizer that tracks first and second moments of the gradient per parameter. Became the default optimizer for deep learning almost overnight.

What it says

Adam maintains exponential moving averages of the gradient (first moment) and its square (second moment). At each step, it normalizes the update by the square root of the second moment, giving each parameter its own effective learning rate. Bias-correction terms account for the zero initialization of the moving averages. The result is an optimizer that works well out of the box across a wide range of tasks with very little tuning.

Why it matters

Adam, and later AdamW (which fixes how weight decay is applied), is the default optimizer for almost all transformer training. It’s forgiving, fast to converge in wall-clock time, and removes a lot of the optimizer-tuning drudgery. For a paper that’s just an optimizer, it’s arguably one of the most cited in machine learning.

  • Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2017) — AdamW.
  • On the Convergence of Adam and Beyond (Reddi et al, 2018) — the convergence issues and AMSGrad fix.
  • Symbolic Discovery of Optimization Algorithms (Chen et al, 2023) — the Lion optimizer found by search.