S8 Sample · Eval harness

← Samples

Eval harness: doc summarizer.

Dashboard view of a prompt v3 eval run on 40 golden examples for a document-summarization feature. Three failures, each with a prescribed follow-up action.

Fictional - representative Download JSON →

Run summary

Pass rate
92.5%
37 of 40
vs v2
+5.0%
improvement
Model
claude-sonnet-4-6
prompt v3
Gate
PASS
threshold 85%

Scorers

ScorerTypeWeightPurpose
contains_all_key_pointsLLM-as-judge0.40Did summary include every expected key point?
no_hallucinated_claimsLLM-as-judge0.30Are all claims supported by source?
length_within_boundsRegex0.1550 to 250 words?
format_valid_jsonStructured0.15Parses as expected schema?

Pass rate over time

100% 85% 70% gate: 85% v0 v0.5 v1 v1.5 v2 v2.5 v3

Failures (3)

e_fail_01 - Board deck: 2026 operating plan

Scores: all_key_points 1.0 | no_hallucinations 0.5 | length 1.0 | json 1.0. Weighted 0.85. Pass threshold 0.85 - marginal fail.

Failure: Output added “well-funded newcomer in the space” and “board approved with minor adjustments to marketing budget.” Neither appears in source.

Action: Prompt v4 will strengthen no-hallucination instruction. Add explicit “do not add context not present in source” to system prompt.

e_fail_02 - Architecture review: event sourcing proposal

Scores: all_key_points 0.25 | no_hallucinations 0.0 | length 0.6 | json 1.0. Weighted 0.34. Hard fail.

Failure: Source says “event sourcing for audit log only, not general state.” Summary says “system-wide adoption.” A factual inversion of the decision. Also missing the prototype spike and the concerns discussed.

Action: Add to “high-stakes inversion” test set. Consider a second-pass verification where the model is asked to check its summary against source.

e_fail_03 - Pricing page redesign brief

Scores: all_key_points 1.0 | no_hallucinations 1.0 | length 1.0 | json 0.0. Weighted 0.85. Marginal fail.

Failure: Summary content is correct. JSON is malformed (unclosed brace).

Action: Not a prompt bug - an infrastructure bug. Add JSON validation + one-shot retry on parse failure in the production API client. Update eval scorer to separate “content is right” from “format parses.”

Takeaways for this run

  1. v3 is a shippable improvement over v2 (+5.0 percent). Approve for prod rollout.
  2. Two of three failures are real prompt issues (hallucination bleed). v4 should target these.
  3. One failure is infrastructure, not prompt. Fix in client code, not prompt.
  4. Hard fail (e_fail_02) is a category we care about. Add 5 more inversion-risk examples to the golden set.
Related guideMedium · 28 min

Evals: a primer

The methodology behind this harness.

Related serviceEngagement

Product development

We build eval harnesses as part of every AI feature we ship.