Calculus for ML

The calculus you need to understand what backprop is actually doing. Derivatives, gradients, and the chain rule — the rest you can Google.

Mathematics beginner #math #calculus #gradients #backprop
Prereqs: High school algebra

What you actually need

Forget limits and epsilon-delta proofs. For ML you need three things:

  1. Derivatives of single-variable functions — the slope at a point. d/dx (x²) = 2x.
  2. Partial derivatives — the slope when you wiggle one variable while holding others fixed.
  3. The chain rule — how derivatives compose through nested functions. This is literally backpropagation.

The gradient

The gradient of a function is just the vector of all its partial derivatives. If you have a loss function L(w₁, w₂, w₃), the gradient is [∂L/∂w₁, ∂L/∂w₂, ∂L/∂w₃]. Pointing in the direction of steepest ascent.

Gradient descent says: to minimize L, step in the direction opposite the gradient. The step size is the learning rate.

Why the chain rule is everything

When you have a neural network, you have nested functions: loss(softmax(linear(relu(linear(x))))). To compute the gradient of the loss with respect to early-layer weights, you have to chain derivatives backward through every layer. That’s backpropagation.

PyTorch and JAX handle this automatically via autograd, but when something breaks — vanishing gradients, exploding gradients, a detach you didn’t mean — understanding what’s happening under the hood is the difference between a 10-minute fix and a week of confused printing.

One concrete walk-through

Take y = (3x + 1)². Two ways to differentiate:

  • Expand: y = 9x² + 6x + 1, so dy/dx = 18x + 6.
  • Chain rule: Let u = 3x + 1, so y = u². Then dy/du = 2u and du/dx = 3, so dy/dx = 2u · 3 = 6u = 18x + 6. ✓

Backprop is doing exactly this, one layer at a time, across billions of parameters.

What to skip

Integrals beyond the basics. You almost never integrate by hand in ML. If you need an integral, you either use a library or sample (Monte Carlo).