Evals: a primer - NexcurAI Guides

What an eval is

An eval is an automated test that measures how well your LLM-backed feature performs on a set of inputs. It answers two questions:

Does the feature still work after the last change?
By how much did it improve (or regress)?

An eval is not a benchmark. Benchmarks measure model capability on generic tasks. Evals measure your feature against your users' real needs.

The three building blocks

Golden dataset: a frozen set of input-output pairs.
Scorer: a function that, given an expected output and an actual output, returns a score.
Runner: the script that ties them together and reports a pass rate.

Step 1: Build the golden dataset

Starting from zero:

Collect 20 real inputs from your feature's actual usage (with user consent / appropriate redaction).
For each input, ask a human (you or a domain expert) to produce the expected output. This is the golden output.
For each example, also record a rationale: why is this the expected output?
Mix deliberately: 10 typical cases, 5 edge cases, 5 adversarial / red-team inputs.

20 is the floor, not the ceiling. Aim for 100 within a month of launch. The dataset grows with the feature.

Step 2: Pick a scorer

Scorers, from most to least rigorous:

Exact match. The output must equal the golden exactly. Works for classification, multiple choice, structured data extraction with a known schema.
Regex / substring match. The output must contain certain substrings, or match a regex. Works for “did it include the required disclaimer?” checks.
Structured equality. For JSON outputs, compare field by field with type coercion. Tolerate reordering.
Semantic similarity. Embed both outputs, compare cosine similarity. Works for “is this close enough?” but noisy.
LLM-as-judge. Ask a separate Claude invocation to rate the output against the golden and a rubric. Flexible, but expensive and needs calibration.
Human review. The scorer of last resort and the calibration source for all the others.

Most real features need 2-3 scorers running in parallel. Structural checks (“is it valid JSON?”) plus a semantic check plus an LLM-as-judge rubric score is a common blend.

Step 3: LLM-as-judge, done right

LLM-as-judge can work, but it fails quietly. To use it well:

Use a different family of model or a different configuration from the model producing the output. Calibrate with humans on a sample before trusting it.
Write an explicit rubric. “Score 1-5 where 5 = factually correct, cites sources, no hallucinations. 1 = factually wrong or refuses the task.”
Have the judge explain its score. The explanation is more useful than the score.
Calibrate by sampling. Every week, take 20 judge scores and have a human re-score. If the correlation drops below 0.8, recalibrate.
Avoid judge-on-judge. Do not ask the same model to both produce and judge. If you must, use a different prompt structure and expect bias.

Step 4: Run on every prompt change

Hook the eval runner into CI:

On every PR that touches a prompt, a system message, or the model choice, run the full eval.
Require the pass rate to be >= the baseline, or PR blocks.
Track pass rate over time in a dashboard. When it trends down, investigate.
On production regressions (user complaints, bug reports), add the failing input to the golden dataset so this specific failure is caught in the future.

Step 5: Regression detection

Evals catch regressions within your development loop. To catch regressions in production:

Sample and score. Every N production requests, score the output with your eval scorer; alert if scores trend down.
User feedback loop. Wire thumbs up / thumbs down into the UI. Monitor thumbs-down rate; investigate when it spikes.
Drift detection. If the distribution of inputs changes (new customer, new use case), your golden dataset may no longer be representative. Re-sample quarterly.

A minimal example

For a feature that summarizes support tickets:

Dataset: 40 tickets, each with a 3-sentence gold summary written by a support engineer.
Scorers:
- Structural: output is 1-3 sentences, under 400 characters.
- Factual: LLM-as-judge rubric scoring factuality and completeness, 1-5 scale, gold as reference.
- Tone: LLM-as-judge rubric scoring tone consistency against a style guide.
Baseline: factual average 4.1, tone average 4.4, structural pass 97%.
Regression gate: any score more than 0.3 below baseline blocks the PR.

A sample eval harness in this shape is at samples/eval-harness-example, with a structured dataset in JSON.

Common mistakes

Too few examples. 5 examples tell you nothing about variance.
No adversarial cases. Your eval looks good and production looks terrible.
Scorer that never fails. If your eval is 100% pass-rate on everything, your scorer is not discriminating.
Judge model same family as production model. Same-family judges systematically prefer outputs that look like themselves.
No regression gate. Evals that exist but do not block PRs become vestigial.
Never growing the dataset. If the dataset is frozen at launch, it stops catching things a month later.

Evals: a primer.