What an eval is
An eval is an automated test that measures how well your LLM-backed feature performs on a set of inputs. It answers two questions:
- Does the feature still work after the last change?
- By how much did it improve (or regress)?
An eval is not a benchmark. Benchmarks measure model capability on generic tasks. Evals measure your feature against your users' real needs.
The three building blocks
- Golden dataset: a frozen set of input-output pairs.
- Scorer: a function that, given an expected output and an actual output, returns a score.
- Runner: the script that ties them together and reports a pass rate.
Step 1: Build the golden dataset
Starting from zero:
- Collect 20 real inputs from your feature's actual usage (with user consent / appropriate redaction).
- For each input, ask a human (you or a domain expert) to produce the expected output. This is the golden output.
- For each example, also record a rationale: why is this the expected output?
- Mix deliberately: 10 typical cases, 5 edge cases, 5 adversarial / red-team inputs.
20 is the floor, not the ceiling. Aim for 100 within a month of launch. The dataset grows with the feature.
Step 2: Pick a scorer
Scorers, from most to least rigorous:
- Exact match. The output must equal the golden exactly. Works for classification, multiple choice, structured data extraction with a known schema.
- Regex / substring match. The output must contain certain substrings, or match a regex. Works for “did it include the required disclaimer?” checks.
- Structured equality. For JSON outputs, compare field by field with type coercion. Tolerate reordering.
- Semantic similarity. Embed both outputs, compare cosine similarity. Works for “is this close enough?” but noisy.
- LLM-as-judge. Ask a separate Claude invocation to rate the output against the golden and a rubric. Flexible, but expensive and needs calibration.
- Human review. The scorer of last resort and the calibration source for all the others.
Most real features need 2-3 scorers running in parallel. Structural checks (“is it valid JSON?”) plus a semantic check plus an LLM-as-judge rubric score is a common blend.
Step 3: LLM-as-judge, done right
LLM-as-judge can work, but it fails quietly. To use it well:
- Use a different family of model or a different configuration from the model producing the output. Calibrate with humans on a sample before trusting it.
- Write an explicit rubric. “Score 1-5 where 5 = factually correct, cites sources, no hallucinations. 1 = factually wrong or refuses the task.”
- Have the judge explain its score. The explanation is more useful than the score.
- Calibrate by sampling. Every week, take 20 judge scores and have a human re-score. If the correlation drops below 0.8, recalibrate.
- Avoid judge-on-judge. Do not ask the same model to both produce and judge. If you must, use a different prompt structure and expect bias.
Step 4: Run on every prompt change
Hook the eval runner into CI:
- On every PR that touches a prompt, a system message, or the model choice, run the full eval.
- Require the pass rate to be >= the baseline, or PR blocks.
- Track pass rate over time in a dashboard. When it trends down, investigate.
- On production regressions (user complaints, bug reports), add the failing input to the golden dataset so this specific failure is caught in the future.
Step 5: Regression detection
Evals catch regressions within your development loop. To catch regressions in production:
- Sample and score. Every N production requests, score the output with your eval scorer; alert if scores trend down.
- User feedback loop. Wire thumbs up / thumbs down into the UI. Monitor thumbs-down rate; investigate when it spikes.
- Drift detection. If the distribution of inputs changes (new customer, new use case), your golden dataset may no longer be representative. Re-sample quarterly.
A minimal example
For a feature that summarizes support tickets:
- Dataset: 40 tickets, each with a 3-sentence gold summary written by a support engineer.
- Scorers:
- Structural: output is 1-3 sentences, under 400 characters.
- Factual: LLM-as-judge rubric scoring factuality and completeness, 1-5 scale, gold as reference.
- Tone: LLM-as-judge rubric scoring tone consistency against a style guide.
- Baseline: factual average 4.1, tone average 4.4, structural pass 97%.
- Regression gate: any score more than 0.3 below baseline blocks the PR.
A sample eval harness in this shape is at samples/eval-harness-example, with a structured dataset in JSON.
Common mistakes
- Too few examples. 5 examples tell you nothing about variance.
- No adversarial cases. Your eval looks good and production looks terrible.
- Scorer that never fails. If your eval is 100% pass-rate on everything, your scorer is not discriminating.
- Judge model same family as production model. Same-family judges systematically prefer outputs that look like themselves.
- No regression gate. Evals that exist but do not block PRs become vestigial.
- Never growing the dataset. If the dataset is frozen at launch, it stops catching things a month later.