Every AI team I’ve worked with hits the same wall around month two. The demos are good. Someone asks “is it actually better than last week?” and the room goes quiet.
Here’s how I get teams unstuck without building yet another dashboard nobody reads.
Start with 30 examples, not 3,000
A small, hand-curated eval set you’ll actually look at beats a giant auto-generated one you won’t. Thirty examples that span the real distribution of user requests — including the weird ones — is enough to detect regressions for months.
The bar isn’t “statistically significant”. The bar is “I trust this set to flag a problem before a user does”.
Score things humans care about, not things models can score
It’s tempting to use BLEU, ROUGE, or an LLM-as-judge for everything. Sometimes that’s right. Often it’s a way of measuring what’s easy instead of what matters.
For a customer-facing summarization feature, the metrics I actually care about are:
- Faithfulness: does the summary contradict the source? (LLM-judged, sampled, then human-reviewed.)
- Useful length: is it short enough that a stakeholder would actually paste it into Slack?
- Did the user accept it? (Acceptance rate from the product itself.)
The third one is the only metric that closes the loop with the user.
Treat evals as product feedback, not gating
If your eval has to be green before you ship, it’ll get gamed. Engineers will tune the model until the eval is green and stop thinking about whether the product is better.
I run evals as feedback, not gates. Regressions get a conversation, not an automatic block. The team decides whether the regression is real, expected, or an artifact of the eval set itself. That conversation is the value.
The cadence that worked for me
- Pre-merge: lightweight eval on a tiny golden set (5-10 examples). Runs in CI, takes under a minute.
- Daily: the full 30-example set runs against main, results posted in a channel.
- Weekly: human spot-check of 5 randomly sampled real user interactions.
The weekly human spot-check is the one I keep cutting and re-adding, and the team always notices when it’s gone.
What I’d tell my past self
- The eval set is a product artifact, not an engineering one. Curate it like you’d curate a roadmap.
- Throwing more examples at a bad eval set doesn’t make it good. It makes it bigger.
- “It feels better” is a real signal. Don’t dismiss it. Just don’t only use it.
The goal isn’t perfect measurement. It’s enough measurement that the team can argue from data instead of taste.