Evals: a primer
The methodology behind this harness.
Dashboard view of a prompt v3 eval run on 40 golden examples for a document-summarization feature. Three failures, each with a prescribed follow-up action.
Fictional - representative Download JSON →
| Scorer | Type | Weight | Purpose |
|---|---|---|---|
contains_all_key_points | LLM-as-judge | 0.40 | Did summary include every expected key point? |
no_hallucinated_claims | LLM-as-judge | 0.30 | Are all claims supported by source? |
length_within_bounds | Regex | 0.15 | 50 to 250 words? |
format_valid_json | Structured | 0.15 | Parses as expected schema? |
Scores: all_key_points 1.0 | no_hallucinations 0.5 | length 1.0 | json 1.0. Weighted 0.85. Pass threshold 0.85 - marginal fail.
Failure: Output added “well-funded newcomer in the space” and “board approved with minor adjustments to marketing budget.” Neither appears in source.
Action: Prompt v4 will strengthen no-hallucination instruction. Add explicit “do not add context not present in source” to system prompt.
Scores: all_key_points 0.25 | no_hallucinations 0.0 | length 0.6 | json 1.0. Weighted 0.34. Hard fail.
Failure: Source says “event sourcing for audit log only, not general state.” Summary says “system-wide adoption.” A factual inversion of the decision. Also missing the prototype spike and the concerns discussed.
Action: Add to “high-stakes inversion” test set. Consider a second-pass verification where the model is asked to check its summary against source.
Scores: all_key_points 1.0 | no_hallucinations 1.0 | length 1.0 | json 0.0. Weighted 0.85. Marginal fail.
Failure: Summary content is correct. JSON is malformed (unclosed brace).
Action: Not a prompt bug - an infrastructure bug. Add JSON validation + one-shot retry on parse failure in the production API client. Update eval scorer to separate “content is right” from “format parses.”
The methodology behind this harness.
The prompts under test here, versioned and commented.
We build eval harnesses as part of every AI feature we ship.