Retrieval-augmented generation fails in two different ways: bad retrieval and bad generation. Measure them separately.
Retrieval metrics
Track whether the gold passage appears in the top-k results for held-out questions. Without decent recall, downstream metrics are misleading.
Grounding checks
Does the answer cite supported content? Simple entailment or human spot checks catch hallucinated specifics early.
User-facing quality
Fluency and tone matter after faithfulness. Use structured rubrics so scores are comparable week over week.