Transformers & Attention

The architecture behind every modern LLM. Attention, Q/K/V, multi-head, positional encodings, and why it scales.

Deep Learning intermediate #deep-learning #transformers #attention #llms

The one-paragraph version

Transformers replaced recurrence with attention. Instead of reading a sequence token by token (RNNs) or with fixed windows (CNNs), a Transformer looks at all positions at once and lets every token gather information from every other token via attention. That parallelism unlocked training on massive corpora, which is how we got from “clever NLP models” to GPT.

If you already get the idea, the rest of this page is an intuition refresh. If you don’t, start here and come back to it twice.

Attention, intuitively

Imagine you’re translating a sentence. As you produce each output word, you look back at the input sentence and pay more attention to the input words that are most relevant to the current output. That’s it. Attention formalizes “look at the most relevant bits” as a differentiable operation.

At each position, attention asks: given where I am right now, which other positions should I pull information from, and how much from each?

Q, K, V — the three vectors

For each token in the input, the model learns three projections:

  • Query (Q) — “what am I looking for?”
  • Key (K) — “what do I represent?”
  • Value (V) — “what do I actually carry?”

The mechanism is four steps:

  1. Score — for each query, compute its dot product with every key. This gives a similarity between “what I need” and “what each token has”.
  2. Scale — divide by sqrt(d_k) to keep the scores well-behaved as dimensionality grows.
  3. Softmax — turn the scores into weights that sum to 1.
  4. Weighted sum — take a weighted average of the value vectors using those weights.

In one formula:

Attention(Q, K, V) = softmax((Q @ K.T) / sqrt(d_k)) @ V

That’s the whole idea. Everything else in a Transformer is this operation repeated, stacked, and wrapped.

Multi-head attention

A single attention head can only learn one “kind” of relationship. Real language has many — syntactic dependencies, coreference, topic alignment, punctuation. Multi-head attention runs several attention computations in parallel with different learned Q/K/V projections, then concatenates the outputs and projects them back down.

Think of heads as independent “eyes” on the same sequence. Each head picks up different patterns. Typical counts are 8, 12, 16, 32 depending on model size.

Positional encodings

Attention is permutation-invariant: swap two tokens and the output is the same (modulo the swap). That’s bad for language, where order is meaning. So we inject position information into the embeddings before attention runs.

  • Sinusoidal encodings (original paper) — deterministic sine/cosine patterns added to embeddings.
  • Learned absolute — a learnable vector per position.
  • Relative and RoPE (Rotary Position Embeddings) — encode relative distances directly into the Q/K projections. RoPE is what modern LLMs (LLaMA, Mistral) use, because it extrapolates better to longer contexts.

Encoder vs decoder vs encoder-decoder

Three flavors, each with a different job:

  • Encoder-only (BERT, RoBERTa). Full bidirectional attention across the sequence. Good for: classification, span extraction, embeddings. Not generative.
  • Decoder-only (GPT, LLaMA, Claude, Mistral). Causal masked attention — each position can only see positions before it. Trained to predict the next token. Good for: generation, chat, completion. This is the dominant paradigm today.
  • Encoder-decoder (T5, BART, original Transformer). Encoder processes the input bidirectionally; decoder generates output tokens with causal attention to its own output plus cross-attention to the encoder states. Good for: translation, summarization, seq2seq tasks with distinct input and output.

Scaling tricks that matter

Plain attention is O(n²) in sequence length. A 32k context is 1024x more attention work than a 1k context. Several tricks fight this:

  • FlashAttention — reworks the kernel so it never materializes the full n × n attention matrix, only streaming blocks. Same math, much less memory, much faster on GPU. Now the default in training and inference stacks.
  • KV caching — at inference, you don’t recompute K and V for tokens you’ve already processed. You cache them. This is why inference time grows roughly linearly in sequence length, not quadratically.
  • Grouped-query attention (GQA) and multi-query attention (MQA) — share K/V projections across groups of heads to shrink the KV cache.
  • Sliding-window / sparse / linear attention — approximate the full attention with cheaper patterns. Used by Mistral, Longformer, Mamba-style hybrids.

Where to go next

The original paper is surprisingly readable — only 15 pages and the math is mostly the four-line attention formula above.

Attention Is All You Need (Vaswani et al., 2017)

After that, read BERT for encoder-only intuition, GPT-2 for decoder-only, and the FlashAttention paper for the systems side of how modern Transformers actually run.