2022 · NeurIPS 2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, Fu, Ermon, Rudra, Ré

2022 NeurIPS 2022

TL;DR

Reorder the attention computation into tiled blocks that fit in GPU SRAM so you avoid materializing the full attention matrix in HBM. Same math, much faster, much less memory.

Read paper

BACKLOG · WORK IN PROGRESS

This paper is being written.

The metadata and shape of this page are stable, but the body content isn't ready yet. We'll publish it once it meets the bar of teaching something new with worked examples and real tools.

Back to papers Track progress on GitHub