Most RAG systems don’t fail at generation. They fail at retrieval, and the model confidently fills in the gap. The result reads fine, sounds authoritative, and is wrong.

After shipping a few of these and quietly fixing more, here’s what actually keeps a RAG pipeline honest.

1. Retrieval has to fail loudly

If your retriever returns nothing, the model should not generate an answer. It should say so. This sounds obvious. It is not what most pipelines do by default.

I add a hard rule: if the top-k similarity scores are all below a threshold, short-circuit before the LLM call and return “I don’t have that in the docs I’ve been given.” Users prefer this. Stakeholders prefer it. The model never lied because it never spoke.

2. Cite the chunks, not the documents

“Source: Employee Handbook” is not a citation. It’s a vibe. Real citations point to a specific chunk with enough surrounding context that a human can verify the claim in under 30 seconds. If your UI doesn’t make that easy, no one will check, and the system gets less trustworthy over time.

3. Re-rank with a cheap model that disagrees

Vector similarity is a starting point, not the answer. A small re-ranking step that asks “does this chunk actually answer the question?” with a cheap LLM call catches the cases where the embedding got it semantically close but topically wrong.

You’d be surprised how often the top vector hit is “this paragraph mentions all the same words but means the opposite thing”.

4. Hold out a query set you didn’t build the system for

The most useful eval set is the one written by someone who didn’t tune the retriever. Get a domain expert to write 30 questions they’d actually ask, run them blind, and read every output. The questions you didn’t anticipate are where the system breaks.

This is not the eval that goes into a slide deck. It’s the eval that tells you whether to ship.

The anti-pattern I keep removing

“Just stuff more context into the prompt.” It feels safe. It’s not. Long contexts dilute the model’s attention to the relevant part, slow down responses, and make hallucinations harder to detect because the supporting text is buried somewhere in there.

Tighter retrieval beats bigger context windows almost every time.

What good looks like

A grounded answer, a verifiable citation, and a graceful “I don’t know” when the docs don’t cover it. That’s the bar. Everything fancier is a bonus.