Attention
A mechanism that lets a model weigh which other tokens in a sequence matter most when computing a representation for the current one.
In one line
A mechanism that lets a model decide which other tokens in a sequence matter most when building a representation for the current one.
What it actually means
Attention takes three projections of your inputs — queries, keys, and values — and uses the dot product of queries and keys to score how relevant each position is to every other position. Those scores get softmaxed into weights, and the output for a position is a weighted sum of the value vectors. In self-attention, all three come from the same sequence, so every token can pull information from every other token in one step. Multi-head attention runs several of these in parallel with different projections so the model can look for different kinds of relationships at once.
Why it matters
Before attention, sequence models had to walk through tokens one at a time and hope earlier information survived the walk. Attention replaced that with a constant-time global lookup, which is what made transformers trainable on huge datasets and what makes long-context LLMs feasible at all. If you understand attention, you understand the core operation inside every modern LLM, vision transformer, and multimodal model.
Example
softmax(Q @ K.T / sqrt(d_k)) @ V
For three tokens with d_k = 4, queries and keys produce a 3x3 matrix of scores. After softmax each row sums to 1 — those are the attention weights for that token over the sequence.
You’ll hear it when
- Reading any transformer paper or LLM architecture diagram.
- Debugging long-context behaviour (“the model isn’t attending to the system prompt”).
- Profiling a model and someone mentions FlashAttention or KV cache.
- Comparing encoder-only vs decoder-only vs cross-attention setups.
- Talking about why context length is quadratic in compute.