Evals · public scorecard

I grade my own assistant.

My post Evals as Product Feedback argues for small, durable eval sets you actually read. This page holds Ask my notes to that standard: 20 hand-written cases, run against the exact pipeline that serves visitors, results committed to the repo. Failures stay on the board — evals are feedback, not gates.

20/20Refusal correctness100%

14/14Retrieval accuracy100%

Last run 2026-07-03 · mode: retrieval-only · no model call — retrieval & gate only

Retrieval-only runs grade the grounding layer (the gate and the ranking) without spending tokens; full runs additionally grade generated answers.

All 20 cases

20 passed

How do I stop an agent loop from sprawling?
in-scope · top: agent-loops-that-dont-sprawl (10.55)
What should I log every turn to debug an agent?
in-scope · top: agent-loops-that-dont-sprawl (12.33)
What queries should I run to audit CMDB health?
in-scope · top: cmdb-health-honest-audit-playbook (13.12)
Who should I interview during a CMDB audit?
in-scope · top: cmdb-health-honest-audit-playbook (7.31)
When should a RAG system refuse to answer?
in-scope · top: rag-that-doesnt-lie (5.74)
Why should citations point to chunks instead of documents?
in-scope · top: rag-that-doesnt-lie (14.8)
Why is stuffing more context into the prompt an anti-pattern?
in-scope · top: rag-that-doesnt-lie (18.73)
How many examples should an eval set start with?
in-scope · top: evals-as-product-feedback (12.71)
Should evals block releases when they regress?
in-scope · top: evals-as-product-feedback (7.35)
What should I extract first when splitting a sprawling custom app?
in-scope · top: decomposing-custom-apps-without-breaking-prod (12.84)
How do I migrate tables to a new scope without breaking production?
in-scope · top: decomposing-custom-apps-without-breaking-prod (15.21)
How should I name Flow Designer flows so they survive refactors?
in-scope · top: flow-designer-patterns-that-survive-refactors (16.72)
Which Now Assist capabilities earned user trust fastest?
in-scope · top: now-assist-rollout-notes (8.77)
What is a neuron in a neural network, actually?
in-scope · top: neural-networks-101 (10.6)
What's your favorite pizza topping?
out-of-scope · refused
How do I configure a Kubernetes ingress controller?
out-of-scope · refused
What's the best hotel in Paris?
out-of-scope · refused
Write me a poem about the ocean.
out-of-scope · refused
What is the stock price of ServiceNow today?
out-of-scope · refused
Adversarial: lexically close to in-scope content. Lexical retrieval may pass the gate here — kept deliberately as an honest hard case.
How should I train for a marathon?
out-of-scope · refused

Methodology

Refusal correctness: Out-of-scope questions must be refused before any model call; in-scope questions must pass the gate. Retrieval accuracy: For in-scope questions, the expected post must be the #1 retrieved section. Groundedness: Generated answers must carry at least one citation to a retrieved section. Citation accuracy: At least one citation must point at the post the question is actually about. The set includes deliberately adversarial cases (lexically similar, semantically out-of-scope) and they stay in even when they fail — a failing eval you can see is worth more than a green dashboard you can't trust.