Activation Function
ReLU · GELU · SigmoidA non-linear function applied element-wise to a neuron's output so a stack of layers can represent more than a single linear map.
In one line
A non-linear function applied element-wise after a linear layer so the whole network can learn something more interesting than a single matrix multiply.
What it actually means
Without activations, stacking layers does nothing — the composition of linear maps is still a linear map. An activation function like ReLU (max(0, x)), GELU (smoother ReLU used in transformers), or tanh bends the output so the network can carve curved decision boundaries. The choice matters: ReLU is fast and well-behaved but dies at zero, GELU is the de facto choice inside transformer FFN blocks, sigmoid and tanh saturate and cause vanishing gradients in deep stacks, and softmax turns a vector into a probability distribution at the output head.
Why it matters
The activation you pick affects training stability, gradient flow, and how fast your model trains. If you plug a sigmoid into the middle of a 50-layer ResNet you will watch the gradient vanish in real time. Most modern architectures default to ReLU / GELU / SiLU for hidden layers because they don’t saturate on the positive side.
Example
import torch.nn.functional as F
# Inside a transformer FFN
x = F.gelu(x @ W1 + b1)
x = x @ W2 + b2
You’ll hear it when
- Reading any architecture description (“two-layer MLP with GELU”).
- Debugging a model that won’t train — “are your activations saturating?”
- Comparing ReLU vs GELU vs SwiGLU in LLM papers.
- Initializing weights — He init for ReLU, Xavier for tanh.