Transformer
A neural network architecture built on stacked self-attention and feed-forward layers — the backbone of every modern LLM.
In one line
A neural network architecture built on stacked self-attention and feed-forward layers — the backbone of every modern LLM.
What it actually means
A transformer block has two sub-layers: multi-head self-attention and a position-wise MLP, each wrapped in residual connections and layer normalisation. You stack a bunch of these blocks (12 for small models, 80+ for frontier ones), feed in token embeddings plus positional information, and train end-to-end with backprop. Decoder-only variants (GPT, Llama, Claude) use causal masking so each token can only attend to earlier ones. Encoder-only variants (BERT) attend in both directions and are used for classification and embeddings.
Why it matters
Transformers scale. They train faster than RNNs because every position can be processed in parallel, they handle long-range dependencies through attention instead of memory, and the compute-quality curve has held up across six orders of magnitude of parameters. If you’re building anything language-shaped in 2026, you’re using a transformer or a hybrid that started as one.
Example
input ids → embed + positional → [ attention → MLP ] x N → unembed → logits
You’ll hear it when
- Reading any model card or architecture diagram.
- Comparing transformer-based LLMs to newer state-space models like Mamba.
- Discussing why GPUs and TPUs got popular for ML.
- Picking between encoder-only, decoder-only, and encoder-decoder for a task.
- Talking about scaling laws.