LLMs

Evaluating RAG pipelines: retrieval first, generation second

Retrieval-augmented generation fails in two different ways: bad retrieval and bad generation. Measure them separately.

Retrieval metrics

Track whether the gold passage appears in the top-k results for held-out questions. Without decent recall, downstream metrics are misleading.

Does the answer cite supported content? Simple entailment or human spot checks catch hallucinated specifics early.

Fluency and tone matter after faithfulness. Use structured rubrics so scores are comparable week over week.

Same cards as the blogs page—related topic first, then newest.

A practical order of operations: abstract, figures, method, then experiments—so you know what to skim and what to study.

Latency, batching, KV-cache memory, and evaluation—not just swapping in a bigger model.

Go deeper on optimization, evaluation, and data—without replaying every undergraduate lecture.

Guidelines, adjudication, and slice-aware QA beat raw throughput when labels drive production decisions.