Why we stopped writing one-shot prompts

1 The liability of the one-shot

The liability of the one-shot.

A one-shot prompt is a prompt that exists once, in the place where it is called, with no version history, no evaluation, and no documentation.

In 2024 we wrote a lot of them. They worked. They shipped. The output was good enough to demo and, usually, good enough for the first few customers. Then something happened: a model upgrade, a new feature that subtly changed the shape of the input, a report from a customer that an output "feels different lately". None of these events by themselves was catastrophic. The trouble was the response. When something changed, we could not tell whether the change was real or remembered. We had no ground truth.

We have a rule now. We do not write a production prompt without also writing the thing that will catch us when the prompt breaks. That thing is a short eval set and a version history. It is not a lot of extra work. It is a different posture.

2 What a prompt is

What a prompt is, once you stop treating it as throwaway.

A prompt is a piece of code that runs on a probabilistic interpreter and is exposed to arbitrary user input. That framing changes everything downstream.

Nothing else in a codebase gets written, deployed, and never looked at again. Nothing else gets the breezy treatment of "let me just try a few words and see what sticks". Production code has version history, tests, reviewers, and a rollback plan. A production prompt needs all four.

Version history means every edit lives in git with a note on what changed and why. The prompt file is human-readable prose and the file lives in the same repo as the code that invokes it. Tests means an evaluation set with golden inputs and scorers, not a vibe check. Reviewers means a person other than the author reads the diff before merge. Rollback plan means the previous version is retrievable with one command.

None of this is exotic. The move from one-shot to library is a move from treating prompts as configuration to treating them as code. Once you make that move, the rest follows.

3 The minimum viable library

The minimum viable library.

You do not need a prompt management platform. You need four things in the repo.

Prompt files. Each prompt lives in its own file: prompts/doc_summary.v3.xml or similar. XML tags, structured sections, comments in the file explaining non-obvious decisions. When the prompt is edited, the file is renamed or versioned and the change goes through code review.

Golden dataset. Per prompt, 20 to 50 test cases covering the shape of real production input. A mix of easy, hard, and adversarial. Stored as JSONL. When someone reports a bad output, the input gets added to the golden set.

Scorers. Per prompt, a short list of things the output must satisfy. Some of these are regex or structured checks (JSON parses, length within bounds). Others are LLM-as-judge prompts. Each scorer has a weight; weighted sum above a threshold is a pass.

A runner. A script that takes a prompt version, runs it against every case in the golden set, computes scores, and emits a pass rate. The runner should be fast enough to run in CI but cheap enough to run on a developer's laptop. Most of ours run in under 90 seconds against 40 cases.

That is it. No platform, no SaaS, no new vendor. Four pieces of code that live in the same repo as the product. Under 400 lines of glue for most projects. The discipline is the whole value; the tooling is secondary.

4 The rituals that keep it alive

The rituals that keep it alive.

A prompt library dies the first week nobody looks at the eval results. Three rituals prevent that.

Every prompt change runs the eval. No exceptions. The eval runs on the PR and the PR will not merge if the pass rate drops without explanation. "Without explanation" is important - sometimes a real improvement causes a drop on cases that were testing the old behavior, and you update the golden set in the same PR.

Every model upgrade reruns the full library. When Anthropic ships a new Claude version, we run the full prompt library against the new model before promoting it to production for the relevant client. This is how we caught a subtle shift in the Sonnet output format last cycle that would have silently broken a structured-output pipeline. The run takes 20 minutes. The breakage would have been weeks of firefighting.

Every customer-reported regression adds a golden case. A user says "this used to work better". We find the specific input, add it to the golden set with the expected behavior, and verify the current prompt version passes. If it does not, the prompt gets a version bump. This is how the library accumulates value over time.

These three rituals are cheap. None of them takes more than 30 minutes per week for a small team. What they buy is the ability to answer the question "is this really different, or does it just feel different?" with data instead of speculation.

5 Write the prompt twice

Closing. Write the prompt twice.

The discipline can be captured in five words: write the prompt twice.

The first write is the one you already do. You are in a notebook or a REPL. You are trying phrasings. You are watching the output. You are iterating by feel. This is the experiment. Let it run until you have a prompt that seems to work.

The second write is the one that ships. You take what you learned from the experiment, structure it into the file, add the golden cases you tested against, write the scorers that encode what "working" means, and run the harness. If the pass rate is above your gate, promote it to production. If it is not, go back to the experiment.

The second write takes maybe ninety minutes on top of the hour or two you already spent on the first. For the first few prompts, this feels like overhead. By the fifth prompt, it feels like the only responsible way to work. By the twentieth, it is load-bearing: your ability to upgrade models, respond to customer reports, and explain the system to a new engineer all depend on the library being there.

The one-shot prompt is still fine for experiments. It should never leave your machine. Anything that serves a customer should be in the library, versioned, evaluated, documented. That is the only way to run AI features for more than one quarter without gradually losing the plot.

E6.X Related work

Sample

Prompt library example

Five prompts, three versions each, eval scores per version.

Sample

Eval harness example

The runner and dashboard that makes the library trustworthy.

Guide

Prompt engineering for operators

The five-part structure that every prompt in the library uses.

E6.S Subscribe