Evals · public scorecard

I grade my own assistant.

My post Evals as Product Feedback argues for small, durable eval sets you actually read. This page holds Ask my notes to that standard: 20 hand-written cases, run against the exact pipeline that serves visitors, results committed to the repo. Failures stay on the board — evals are feedback, not gates.

20/20Refusal correctness100%
14/14Retrieval accuracy100%

Last run 2026-07-03 · mode: retrieval-only · no model call — retrieval & gate only

Retrieval-only runs grade the grounding layer (the gate and the ranking) without spending tokens; full runs additionally grade generated answers.

All 20 cases

20 passed
  • How do I stop an agent loop from sprawling?

    in-scope · top: agent-loops-that-dont-sprawl (10.55)

  • What should I log every turn to debug an agent?

    in-scope · top: agent-loops-that-dont-sprawl (12.33)

  • What queries should I run to audit CMDB health?

    in-scope · top: cmdb-health-honest-audit-playbook (13.12)

  • Who should I interview during a CMDB audit?

    in-scope · top: cmdb-health-honest-audit-playbook (7.31)

  • When should a RAG system refuse to answer?

    in-scope · top: rag-that-doesnt-lie (5.74)

  • Why should citations point to chunks instead of documents?

    in-scope · top: rag-that-doesnt-lie (14.8)

  • Why is stuffing more context into the prompt an anti-pattern?

    in-scope · top: rag-that-doesnt-lie (18.73)

  • How many examples should an eval set start with?

    in-scope · top: evals-as-product-feedback (12.71)

  • Should evals block releases when they regress?

    in-scope · top: evals-as-product-feedback (7.35)

  • What should I extract first when splitting a sprawling custom app?

    in-scope · top: decomposing-custom-apps-without-breaking-prod (12.84)

  • How do I migrate tables to a new scope without breaking production?

    in-scope · top: decomposing-custom-apps-without-breaking-prod (15.21)

  • How should I name Flow Designer flows so they survive refactors?

    in-scope · top: flow-designer-patterns-that-survive-refactors (16.72)

  • Which Now Assist capabilities earned user trust fastest?

    in-scope · top: now-assist-rollout-notes (8.77)

  • What is a neuron in a neural network, actually?

    in-scope · top: neural-networks-101 (10.6)

  • What's your favorite pizza topping?

    out-of-scope · refused

  • How do I configure a Kubernetes ingress controller?

    out-of-scope · refused

  • What's the best hotel in Paris?

    out-of-scope · refused

  • Write me a poem about the ocean.

    out-of-scope · refused

  • What is the stock price of ServiceNow today?

    out-of-scope · refused

    Adversarial: lexically close to in-scope content. Lexical retrieval may pass the gate here — kept deliberately as an honest hard case.

  • How should I train for a marathon?

    out-of-scope · refused

Methodology

Refusal correctness: Out-of-scope questions must be refused before any model call; in-scope questions must pass the gate. Retrieval accuracy: For in-scope questions, the expected post must be the #1 retrieved section. Groundedness: Generated answers must carry at least one citation to a retrieved section. Citation accuracy: At least one citation must point at the post the question is actually about. The set includes deliberately adversarial cases (lexically similar, semantically out-of-scope) and they stay in even when they fail — a failing eval you can see is worth more than a green dashboard you can't trust.