Gradient Descent
SGDAn iterative optimization method that nudges parameters in the opposite direction of the gradient to reduce a loss.
In one line
An iterative optimization method that nudges parameters in the opposite direction of the gradient to reduce a loss.
What it actually means
You start from some initial weights, compute the loss on a batch of data, use backprop to get the gradient of that loss, then take a step in the direction that lowers it. The size of that step is the learning rate. Vanilla SGD uses one batch at a time. Modern variants — Adam, AdamW, Lion — keep moving averages of the gradient and its square so the step size adapts per parameter. The whole training loop is just this update repeated millions of times.
Why it matters
Gradient descent is the engine of essentially every neural network you’ll train. Choosing the optimizer, learning rate schedule, and batch size are some of the highest-leverage decisions you make, and most “training won’t converge” problems trace back to one of those three.
Example
w_{t+1} = w_t - lr * ∂L/∂w_t
With lr = 0.01 and a gradient of 2.0, the weight moves by -0.02 each step.
You’ll hear it when
- Tuning a learning rate or learning rate schedule.
- Picking AdamW vs SGD with momentum for a new training run.
- Diagnosing loss spikes or NaN losses.
- Reading about second-order methods or natural gradients.
- Setting up a warmup or cosine decay schedule.