Retrieval-Augmented Generation (RAG)

Give an LLM fresh, private, or domain-specific knowledge at query time by retrieving relevant chunks and stuffing them into the prompt.

LLMs & Generative AI intermediate #llms #rag #embeddings #vector-search

What RAG solves

LLMs have three recurring failure modes in production:

Knowledge cutoff. The model doesn’t know about events after its training data.
Private or proprietary knowledge. It has never seen your company’s docs, tickets, or runbooks.
Hallucination. When it doesn’t know, it confidently makes things up.

Fine-tuning addresses some of this but is expensive, slow to iterate on, and bad at “just learn this new fact”. RAG (Retrieval-Augmented Generation) is the alternative: at query time, look up the relevant documents, paste them into the prompt, and let the LLM answer with grounded context. You trade context-window tokens for fresh, attributable knowledge.

The basic pipeline

          ingest
docs ──▶  chunk  ──▶  embed  ──▶  vector store
                                       │
                                       ▼
query ──▶ embed ──▶ retrieve (top-k) ──▶ prompt ──▶ LLM ──▶ answer + sources

Five steps you have to get right:

Chunk the documents into passages small enough to fit in the prompt but large enough to be self-contained.
Embed each chunk into a vector with an embedding model.
Store the (chunk, embedding, metadata) tuples in a vector database.
Retrieve the top-k chunks most similar to the query embedding.
Generate an answer with the LLM, feeding it the retrieved chunks as context and asking it to cite sources.

The first three happen at ingest time (offline). The last two happen at query time (latency-sensitive).

Chunking strategies

The chunk size is the single most impactful knob.

Fixed-size with overlap. Split every 500–1000 tokens with ~10% overlap. Simple, robust, and the right default.
Recursive / structural. Split on headings, paragraphs, then sentences. Preserves semantic boundaries. Works well for docs with clear structure.
Sentence-aware. Packs sentences into chunks up to a target size. A middle ground.
Parent/child. Embed small chunks for precise retrieval; return the larger parent passage to the LLM for context.

Common mistakes:

Chunks too small → retrieval returns enough snippets to answer, but each one lacks the context the LLM needs.
Chunks too large → each chunk wastes embedding space on unrelated content; similarity scores get noisy.
No metadata. Always store source URL, title, section heading, and timestamps. You’ll need them for filtering, citation, and debugging.

Hybrid search: BM25 + dense

Dense embeddings are great at meaning but miss exact-match signals (product codes, error IDs, rare names). BM25 (classic keyword search) catches exactly those cases. Running both and combining the scores gives you a meaningful quality lift on almost every real corpus.

A common recipe:

Dense retrieve top 50.
BM25 retrieve top 50.
Combine with Reciprocal Rank Fusion (RRF): score = Σ 1/(k + rank_i).
Pass the top 20 to a reranker.

Most vector DBs (OpenSearch, Weaviate, Atlas Vector Search, Vespa) support both in a single query.

Reranking

The initial retriever is optimized for recall — cast a wide net. A cross-encoder reranker takes the top 20–50 candidates and rescores them by running a smaller model on each (query, chunk) pair. Cross-encoders see both texts together and are much more accurate than bi-encoders (which only see them separately).

Good defaults: bge-reranker-v2-m3, Cohere Rerank, or a domain-tuned cross-encoder. Rerank latency is real (50–300 ms typical); budget for it.

Evaluation with Ragas

“Does it feel better?” is not an evaluation strategy. A RAG system has at least three dials and you need to measure each one.

Ragas (and friends — TruLens, DeepEval) define metrics like:

Context precision — are the retrieved chunks actually relevant to the question?
Context recall — did we retrieve everything needed to answer?
Faithfulness — does the generated answer only say things supported by the retrieved chunks?
Answer relevance — does the answer address the question?

Build a small gold set of (question, ideal answer, ideal sources) — 50 examples is enough to start. Run it on every meaningful change.

Common failure modes

Retrieval returns the wrong documents (bad chunking, bad embedding model, no metadata filtering). Fix retrieval first — the generator can’t fix what it was never shown.
Retrieval returns the right documents, but the LLM ignores them. The prompt is unclear, the context is too long, or the LLM trusts its prior over the context. Fix the prompt, shrink the context, or use a stronger model.
The LLM cites things that aren’t in the context. Add an instruction to cite only from the provided sources; post-process to verify citations actually exist.
Lost-in-the-middle. LLMs attend more to the start and end of long contexts. Keep the prompt short and rank order the most important chunks to the edges.
Stale index. The docs have updated, the vector store has not. Set up an ingestion pipeline that re-embeds on source changes.
Multi-hop questions. “Which customer’s most recent support ticket mentioned our competitor?” can’t be answered with a single retrieval. You need query decomposition, iterative retrieval, or an agent.

When not to RAG

The corpus fits in the context window. Just paste it.
The knowledge is truly static and needs to be reflected in the model’s default behavior — prefer fine-tuning.
You need exact arithmetic or structured data queries. Use tools / SQL, not text retrieval.