Calculus for ML
The calculus you need to understand what backprop is actually doing. Derivatives, gradients, and the chain rule — the rest you can Google.
What you actually need
Forget limits and epsilon-delta proofs. For ML you need three things:
- Derivatives of single-variable functions — the slope at a point.
d/dx (x²) = 2x. - Partial derivatives — the slope when you wiggle one variable while holding others fixed.
- The chain rule — how derivatives compose through nested functions. This is literally backpropagation.
The gradient
The gradient of a function is just the vector of all its partial derivatives. If you have a loss function L(w₁, w₂, w₃), the gradient is [∂L/∂w₁, ∂L/∂w₂, ∂L/∂w₃]. Pointing in the direction of steepest ascent.
Gradient descent says: to minimize L, step in the direction opposite the gradient. The step size is the learning rate.
Why the chain rule is everything
When you have a neural network, you have nested functions: loss(softmax(linear(relu(linear(x))))). To compute the gradient of the loss with respect to early-layer weights, you have to chain derivatives backward through every layer. That’s backpropagation.
PyTorch and JAX handle this automatically via autograd, but when something breaks — vanishing gradients, exploding gradients, a detach you didn’t mean — understanding what’s happening under the hood is the difference between a 10-minute fix and a week of confused printing.
One concrete walk-through
Take y = (3x + 1)². Two ways to differentiate:
- Expand:
y = 9x² + 6x + 1, sody/dx = 18x + 6. - Chain rule: Let
u = 3x + 1, soy = u². Thendy/du = 2uanddu/dx = 3, sody/dx = 2u · 3 = 6u = 18x + 6. ✓
Backprop is doing exactly this, one layer at a time, across billions of parameters.
What to skip
Integrals beyond the basics. You almost never integrate by hand in ML. If you need an integral, you either use a library or sample (Monte Carlo).