S8 Sample · Eval harness

Eval harness: doc summarizer.

Dashboard view of a prompt v3 eval run on 40 golden examples for a document-summarization feature. Three failures, each with a prescribed follow-up action.

Fictional - representative Download JSON →

Run summary

Pass rate

92.5%

37 of 40

vs v2

+5.0%

improvement

Model

claude-sonnet-4-6

prompt v3

Gate

PASS

threshold 85%

Scorers

Scorer	Type	Weight	Purpose
`contains_all_key_points`	LLM-as-judge	0.40	Did summary include every expected key point?
`no_hallucinated_claims`	LLM-as-judge	0.30	Are all claims supported by source?
`length_within_bounds`	Regex	0.15	50 to 250 words?
`format_valid_json`	Structured	0.15	Parses as expected schema?

Pass rate over time

Failures (3)

e_fail_01 - Board deck: 2026 operating plan

Scores: all_key_points 1.0 | no_hallucinations 0.5 | length 1.0 | json 1.0. Weighted 0.85. Pass threshold 0.85 - marginal fail.

Failure: Output added “well-funded newcomer in the space” and “board approved with minor adjustments to marketing budget.” Neither appears in source.

Action: Prompt v4 will strengthen no-hallucination instruction. Add explicit “do not add context not present in source” to system prompt.

e_fail_02 - Architecture review: event sourcing proposal

Scores: all_key_points 0.25 | no_hallucinations 0.0 | length 0.6 | json 1.0. Weighted 0.34. Hard fail.

Failure: Source says “event sourcing for audit log only, not general state.” Summary says “system-wide adoption.” A factual inversion of the decision. Also missing the prototype spike and the concerns discussed.

Action: Add to “high-stakes inversion” test set. Consider a second-pass verification where the model is asked to check its summary against source.

e_fail_03 - Pricing page redesign brief

Scores: all_key_points 1.0 | no_hallucinations 1.0 | length 1.0 | json 0.0. Weighted 0.85. Marginal fail.

Failure: Summary content is correct. JSON is malformed (unclosed brace).

Action: Not a prompt bug - an infrastructure bug. Add JSON validation + one-shot retry on parse failure in the production API client. Update eval scorer to separate “content is right” from “format parses.”

Takeaways for this run

v3 is a shippable improvement over v2 (+5.0 percent). Approve for prod rollout.
Two of three failures are real prompt issues (hallucination bleed). v4 should target these.
One failure is infrastructure, not prompt. Fix in client code, not prompt.
Hard fail (e_fail_02) is a category we care about. Add 5 more inversion-risk examples to the golden set.

Related guideMedium · 28 min