2020 EMNLP 2020

Dense Passage Retrieval for Open-Domain Question Answering

Karpukhin, Oguz, Min, Lewis, Wu, Edunov, Chen, Yih

TL;DR
Train a BERT-based dual encoder with in-batch negatives to embed queries and passages into the same space. Beats BM25 for open-domain QA retrieval.

What it says

DPR uses two BERT encoders — one for questions, one for passages — trained so that matching question/passage pairs have high dot product and non-matching pairs have low dot product. Training uses in-batch negatives (other passages in the same mini-batch act as negatives for free) plus hard negatives mined from BM25. At inference, all passages are embedded offline; a new question is embedded at query time and nearest neighbors are retrieved via FAISS.

Why it matters

DPR is the canonical recipe for dense retrieval and the direct ancestor of every modern embedding model used in RAG. The dual-encoder, in-batch-negatives pattern is still the default way to train retrieval models in 2026. Before DPR, BM25 was the hard-to-beat baseline; after DPR, dense retrieval became the default.

  • ColBERT (Khattab & Zaharia, 2020) — late-interaction retrieval, better quality at higher cost.
  • Sentence-BERT (Reimers & Gurevych, 2019) — the earlier dual-encoder that set the template.
  • E5 / BGE embedding models (2023–2024) — modern production-grade descendants.