New Self-paced AI courses — learn ML, deep learning, and agents on your schedule. Enroll free

Evaluating RAG pipelines: retrieval first, generation second

Cover illustration for this article

Retrieval-augmented generation fails in two different ways: bad retrieval and bad generation. Measure them separately.

Retrieval metrics

Track whether the gold passage appears in the top-k results for held-out questions. Without decent recall, downstream metrics are misleading.

Grounding checks

Does the answer cite supported content? Simple entailment or human spot checks catch hallucinated specifics early.

User-facing quality

Fluency and tone matter after faithfulness. Use structured rubrics so scores are comparable week over week.

Same cards as the blogs page—related topic first, then newest.