Multi-Head Attention
Running several attention operations in parallel with different learned projections so the model can attend to multiple relationships at once.
In one line
Split queries, keys, and values into h heads, run attention independently in each, then concatenate — so the model can look for multiple kinds of relationships at the same time.
What it actually means
Single-head attention gives each token one attention distribution over the sequence. Multi-head attention projects the inputs into h lower-dimensional subspaces with separate weight matrices, runs scaled dot-product attention independently in each head, and concatenates the results before a final projection. Different heads learn to focus on different things — one might track subject-verb agreement, another might attend to nearby punctuation, another might do coreference. Variants like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) share K/V projections across heads to cut KV-cache memory during inference.
Why it matters
Multi-head attention is the reason transformers work as well as they do. The expressive power comes from having many parallel attention patterns, not from making one pattern bigger. Every transformer has it; GQA/MQA are now the default in open LLMs because they make long-context inference far cheaper.
Example
import torch.nn as nn
mha = nn.MultiheadAttention(embed_dim=768, num_heads=12, batch_first=True)
out, weights = mha(q, k, v)
You’ll hear it when
- Reading any transformer architecture diagram.
- Discussing KV cache memory in LLM serving.
- Comparing MHA, MQA, and GQA in open-weights models.
- Interpreting attention patterns for model analysis.