2021 CVPR 2022

High-Resolution Image Synthesis with Latent Diffusion Models

Rombach, Blattmann, Lorenz, Esser, Ommer

TL;DR
Run the diffusion denoising process in the compressed latent space of a pretrained autoencoder instead of in pixel space. Cuts compute by an order of magnitude and enables text-to-image at home.

What it says

Pixel-space diffusion is expensive because every denoising step operates on the full image resolution. The authors train a VAE that encodes an image into a small latent (e.g. 64x64x4 for a 512x512 input) and run the diffusion model there, conditioning on text via cross-attention from a frozen text encoder. A decoder upsamples the denoised latent back to pixels at the end. The result — Stable Diffusion — generates high-quality 512x512 images on a single consumer GPU.

Why it matters

This paper is the reason text-to-image went from “expensive research demo” to “anyone can run it”. The model weights were released openly, which spawned an enormous ecosystem: LoRA fine-tunes, ControlNet, SDXL, image editing workflows, ComfyUI. Latent diffusion is also the foundation of most modern video generation models.

  • DDPM (Ho et al, 2020) — the diffusion paper this builds on.
  • SDXL (Podell et al, 2023) — the bigger and better follow-up.
  • ControlNet (Zhang et al, 2023) — conditioning on edges, depth, poses.