Retrieval-Augmented Generation
RAGA pattern where you retrieve relevant documents at query time and stuff them into the LLM's prompt so it can ground its answer in real sources.
In one line
A pattern where you retrieve relevant documents at query time and stuff them into the LLM’s prompt so it can ground its answer in real sources.
What it actually means
A RAG pipeline has two halves. Offline: you take your corpus, chunk it, embed each chunk, and store the vectors in a vector database alongside the original text. Online: a user query comes in, you embed it, run a nearest-neighbour search to pull the top-k chunks, optionally rerank them, then build a prompt that says “answer the question using only this context” and call the LLM. The model never gets to see your full knowledge base — only what retrieval surfaced for this query.
Why it matters
RAG is how you use an LLM on data it wasn’t trained on without fine-tuning. It’s cheaper, easier to update (just re-index), and you can show citations. The catch is that retrieval quality dominates everything: a great LLM with bad retrieval will confidently answer with the wrong context. Most “our RAG isn’t working” problems are really chunking, embedding, or reranker problems.
Example
1. user query → embed → top-k chunks via vector search
2. (optional) → rerank top-k with a cross-encoder, keep top-n
3. prompt → "Answer using only the context below.\n{chunks}\n\nQ: {query}"
4. LLM call → answer + citations to chunk IDs
You’ll hear it when
- Building a chatbot over internal docs, policy, or a product knowledge base.
- Debating fine-tuning vs RAG for adding domain knowledge.
- Picking a vector database and an embedding model.
- Evaluating retrieval quality (recall@k, MRR, nDCG).
- Designing citations and source attribution for a generated answer.