G5.2 Guide · Product development

← Product guides

Evals: a primer.

Your LLM-backed feature is unmeasured until you have an eval. This is how to build one from nothing.

Length: 28 min Audience: eng lead / PM / operator Last updated: 2026-04-19

What an eval is

An eval is an automated test that measures how well your LLM-backed feature performs on a set of inputs. It answers two questions:

  1. Does the feature still work after the last change?
  2. By how much did it improve (or regress)?

An eval is not a benchmark. Benchmarks measure model capability on generic tasks. Evals measure your feature against your users' real needs.

The three building blocks

  1. Golden dataset: a frozen set of input-output pairs.
  2. Scorer: a function that, given an expected output and an actual output, returns a score.
  3. Runner: the script that ties them together and reports a pass rate.

Step 1: Build the golden dataset

Starting from zero:

  • Collect 20 real inputs from your feature's actual usage (with user consent / appropriate redaction).
  • For each input, ask a human (you or a domain expert) to produce the expected output. This is the golden output.
  • For each example, also record a rationale: why is this the expected output?
  • Mix deliberately: 10 typical cases, 5 edge cases, 5 adversarial / red-team inputs.

20 is the floor, not the ceiling. Aim for 100 within a month of launch. The dataset grows with the feature.

Step 2: Pick a scorer

Scorers, from most to least rigorous:

  • Exact match. The output must equal the golden exactly. Works for classification, multiple choice, structured data extraction with a known schema.
  • Regex / substring match. The output must contain certain substrings, or match a regex. Works for “did it include the required disclaimer?” checks.
  • Structured equality. For JSON outputs, compare field by field with type coercion. Tolerate reordering.
  • Semantic similarity. Embed both outputs, compare cosine similarity. Works for “is this close enough?” but noisy.
  • LLM-as-judge. Ask a separate Claude invocation to rate the output against the golden and a rubric. Flexible, but expensive and needs calibration.
  • Human review. The scorer of last resort and the calibration source for all the others.

Most real features need 2-3 scorers running in parallel. Structural checks (“is it valid JSON?”) plus a semantic check plus an LLM-as-judge rubric score is a common blend.

Step 3: LLM-as-judge, done right

LLM-as-judge can work, but it fails quietly. To use it well:

  • Use a different family of model or a different configuration from the model producing the output. Calibrate with humans on a sample before trusting it.
  • Write an explicit rubric. “Score 1-5 where 5 = factually correct, cites sources, no hallucinations. 1 = factually wrong or refuses the task.”
  • Have the judge explain its score. The explanation is more useful than the score.
  • Calibrate by sampling. Every week, take 20 judge scores and have a human re-score. If the correlation drops below 0.8, recalibrate.
  • Avoid judge-on-judge. Do not ask the same model to both produce and judge. If you must, use a different prompt structure and expect bias.

Step 4: Run on every prompt change

Hook the eval runner into CI:

  • On every PR that touches a prompt, a system message, or the model choice, run the full eval.
  • Require the pass rate to be >= the baseline, or PR blocks.
  • Track pass rate over time in a dashboard. When it trends down, investigate.
  • On production regressions (user complaints, bug reports), add the failing input to the golden dataset so this specific failure is caught in the future.

Step 5: Regression detection

Evals catch regressions within your development loop. To catch regressions in production:

  • Sample and score. Every N production requests, score the output with your eval scorer; alert if scores trend down.
  • User feedback loop. Wire thumbs up / thumbs down into the UI. Monitor thumbs-down rate; investigate when it spikes.
  • Drift detection. If the distribution of inputs changes (new customer, new use case), your golden dataset may no longer be representative. Re-sample quarterly.

A minimal example

For a feature that summarizes support tickets:

  • Dataset: 40 tickets, each with a 3-sentence gold summary written by a support engineer.
  • Scorers:
    • Structural: output is 1-3 sentences, under 400 characters.
    • Factual: LLM-as-judge rubric scoring factuality and completeness, 1-5 scale, gold as reference.
    • Tone: LLM-as-judge rubric scoring tone consistency against a style guide.
  • Baseline: factual average 4.1, tone average 4.4, structural pass 97%.
  • Regression gate: any score more than 0.3 below baseline blocks the PR.

A sample eval harness in this shape is at samples/eval-harness-example, with a structured dataset in JSON.

Common mistakes

  • Too few examples. 5 examples tell you nothing about variance.
  • No adversarial cases. Your eval looks good and production looks terrible.
  • Scorer that never fails. If your eval is 100% pass-rate on everything, your scorer is not discriminating.
  • Judge model same family as production model. Same-family judges systematically prefer outputs that look like themselves.
  • No regression gate. Evals that exist but do not block PRs become vestigial.
  • Never growing the dataset. If the dataset is frozen at launch, it stops catching things a month later.

Related