FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Dao, Fu, Ermon, Rudra, Ré
What it says
Standard attention computes the full N×N score matrix in GPU HBM, then applies softmax, then multiplies by V. That traffic between SRAM and HBM is the bottleneck on real hardware, not the FLOPs. FlashAttention tiles the computation so each block of queries and keys is loaded into SRAM once and softmax is computed online using a running normalization trick. The numerical result is bit-identical to standard attention but throughput is 2–4x higher and memory scales linearly in sequence length instead of quadratically.
Why it matters
FlashAttention is the reason long-context training and inference are affordable. Every modern LLM training stack uses it by default (often via FlashAttention-2 or FA-3, which add further optimizations for newer GPUs and causal masking). If you’re training or serving a transformer and not using some variant of FlashAttention, you’re leaving a large multiplier of performance on the table.
Read next
- FlashAttention-2 (Dao, 2023) — better work partitioning and parallelism.
- FlashAttention-3 (Shah et al, 2024) — Hopper-specific warp specialization and FP8.
- PagedAttention / vLLM (Kwon et al, 2023) — the memory-management analogue for KV cache in serving.