Ragas
An evaluation framework for retrieval-augmented generation systems — faithfulness, answer relevance, context precision, and more.
Category
MLOps
Difficulty
Intermediate
When to use
You have a RAG pipeline and need quantitative metrics beyond eyeballing outputs, especially for regression testing and comparison.
When not to use
You have no eval set at all yet — build one first, then reach for Ragas.
Alternatives
TruLens DeepEval LangSmith evals Custom LLM-as-judge
At a glance
| Field | Value |
|---|---|
| Category | RAG evaluation framework |
| Difficulty | Intermediate |
| When to use | Measuring RAG quality over a fixed dataset |
| When not to use | You have no labeled or golden examples yet |
| Alternatives | TruLens, DeepEval, LangSmith evals |
What it is
Ragas provides a set of RAG-specific metrics — faithfulness (does the answer stay grounded in retrieved context), answer relevance, context precision, context recall — that use an LLM as a judge under the hood. You feed in a dataset of (question, answer, contexts, ground_truth) rows and Ragas scores each row on each metric.
When we reach for it at Ephizen
- Before and after any change to chunking, embedding model, or reranker.
- A/B comparing retrieval strategies on the same golden questions.
- Catching regressions when we swap LLM providers or models.
- Generating a scorecard that non-ML stakeholders can actually read.
Getting started
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
ds = Dataset.from_dict({
"question": [...],
"answer": [...],
"contexts": [...], # list[list[str]]
"ground_truth": [...],
})
result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)
Gotchas
- Ragas calls an LLM per metric per row. On a 1000-row eval that adds up fast; use a cheap judge model.
- LLM-as-judge is noisy. Run evals twice and report the mean, not a single number.
- Metrics are heuristics, not ground truth. Always spot-check outliers by hand.