High-Resolution Image Synthesis with Latent Diffusion Models
Rombach, Blattmann, Lorenz, Esser, Ommer
What it says
Pixel-space diffusion is expensive because every denoising step operates on the full image resolution. The authors train a VAE that encodes an image into a small latent (e.g. 64x64x4 for a 512x512 input) and run the diffusion model there, conditioning on text via cross-attention from a frozen text encoder. A decoder upsamples the denoised latent back to pixels at the end. The result — Stable Diffusion — generates high-quality 512x512 images on a single consumer GPU.
Why it matters
This paper is the reason text-to-image went from “expensive research demo” to “anyone can run it”. The model weights were released openly, which spawned an enormous ecosystem: LoRA fine-tunes, ControlNet, SDXL, image editing workflows, ComfyUI. Latent diffusion is also the foundation of most modern video generation models.
Read next
- DDPM (Ho et al, 2020) — the diffusion paper this builds on.
- SDXL (Podell et al, 2023) — the bigger and better follow-up.
- ControlNet (Zhang et al, 2023) — conditioning on edges, depth, poses.