Retrieval-Augmented Generation

RAG
RAG

A pattern where you retrieve relevant documents at query time and stuff them into the LLM's prompt so it can ground its answer in real sources.


In one line

A pattern where you retrieve relevant documents at query time and stuff them into the LLM’s prompt so it can ground its answer in real sources.

What it actually means

A RAG pipeline has two halves. Offline: you take your corpus, chunk it, embed each chunk, and store the vectors in a vector database alongside the original text. Online: a user query comes in, you embed it, run a nearest-neighbour search to pull the top-k chunks, optionally rerank them, then build a prompt that says “answer the question using only this context” and call the LLM. The model never gets to see your full knowledge base — only what retrieval surfaced for this query.

Why it matters

RAG is how you use an LLM on data it wasn’t trained on without fine-tuning. It’s cheaper, easier to update (just re-index), and you can show citations. The catch is that retrieval quality dominates everything: a great LLM with bad retrieval will confidently answer with the wrong context. Most “our RAG isn’t working” problems are really chunking, embedding, or reranker problems.

Example

1. user query   → embed → top-k chunks via vector search
2. (optional)   → rerank top-k with a cross-encoder, keep top-n
3. prompt       → "Answer using only the context below.\n{chunks}\n\nQ: {query}"
4. LLM call     → answer + citations to chunk IDs

You’ll hear it when

  • Building a chatbot over internal docs, policy, or a product knowledge base.
  • Debating fine-tuning vs RAG for adding domain knowledge.
  • Picking a vector database and an embedding model.
  • Evaluating retrieval quality (recall@k, MRR, nDCG).
  • Designing citations and source attribution for a generated answer.

Related terms

See also