The templates that survive a model upgrade

1 Why this matters: the cost of a brittle template.

Why this matters: the cost of a brittle template.

Every time Anthropic ships a new model, our engagements run against it within days. The question that decides whether the firm lives or dies that week is: did anything we built yesterday break on the new model?

If the answer is "we have to rewrite 30 templates before the next engagement ships," we are in a very bad place. We will miss deadlines, client handbooks will ship with regressions we did not catch, and we will spend the week firefighting instead of delivering. That is the failure mode we design against.

The good news: model upgrades from Anthropic have been remarkably backward-compatible in the ways that matter for prompt engineering. The bad news: "mostly compatible" is not the same as "all compatible," and the templates that do break are predictable in a way nobody tells you about when you first write them.

This essay is what we learned carrying a 140-template library across six model releases. It is the specific structural patterns that survive, the ones that break, and the rewrite playbook we use before the next release lands.

2 The six model upgrades we carried the library across.

The six model upgrades we carried the library across.

For readers who have not been tracking the releases: here is what we mean by "six upgrades" and what our library looked like at each point.

Sonnet 3.5 baseline (mid-2024): 40 templates at library start. Most are rough. No eval harness yet.
Sonnet 4 (early 2025): 68 templates. 4 broke on upgrade, all classification tasks that had become over-verbose.
Opus 4 (mid-2025): 82 templates. 11 broke. Extended thinking launched here and several templates needed budget tuning.
Opus 4.5 (late 2025): 98 templates. 3 broke. Mostly cosmetic output-format drift.
Opus 4.6 / Fast mode (early 2026): 117 templates. 7 broke. Latency-sensitive templates suddenly had new cost tradeoffs to consider.
Opus 4.7 (current): 140 templates. 2 broke on first evaluation, both tool-use templates that had been counting on a specific tool-calling format that changed subtly.

Count the total: 27 broken templates across 545 template-upgrade events. That is 5 percent break rate on upgrade, and the rate has been dropping as we learned which patterns were brittle.

We care about this number not because it is good or bad in absolute terms, but because it tells us how much buffer we need on release day. Budget one operator-day to evaluate the library on every major model release. Expect to rewrite between 2 and 12 templates. Ship the rewrite within 72 hours.

3 What a durable template looks like structurally.

What a durable template looks like structurally.

The templates that have never broken share a specific structural shape. Not a shape about language or phrasing. A shape about what the template tells the model to do.

Explicit role and scope.

Every durable template opens with a role statement that constrains what kind of answer is allowed. "You are a security engineer writing a finding in the style of NIST SP 800-115." Not "you are an expert at security findings." The former gives the model a defined shape it can refuse to drift from. The latter invites style drift on every minor model update.

Output contract, not output example.

Durable templates describe the output in terms of constraints: required sections, field types, format expectations. They do not just show one example and hope the model generalizes. Brittle templates often have a single "here is what we want" example that turns out to match one model's default voice and not another's. When the model changes, the example no longer anchors the output shape, and you get drift.

Structured input via XML or JSON.

Every durable template we have uses tagged input blocks, either XML-style or JSON payloads. Templates that glued inputs together with free-form sentences ("The company is X and the environment is Y...") tend to drift when the new model tokenizes the combined string slightly differently. Structure is self-documenting and survives tokenization changes.

Refusal pathways built in.

Durable templates explicitly tell the model what to do when it does not have enough information. "If the input does not include a CVSS vector, return `CVSS: insufficient_info` rather than inferring." This saves you from a whole class of "the new model hallucinated a plausible value" regressions.

No hidden model-specific quirks.

If the template depends on a quirk (the model's tendency to start with "Certainly," or a specific tool-calling format), it is brittle. Every quirk we have depended on has broken in at least one upgrade. The rule we now follow: if you cannot explain why the template works to someone who has never used this model family, rewrite it.

4 The two failure modes that break a template on upgrade.

The two failure modes that break a template on upgrade.

Every broken template across six upgrades has fallen into one of two categories. We know the failure modes well enough now to diagnose them in minutes.

Failure mode A: verbosity drift.

Newer models are often tuned to be more thorough by default. A template that produced a 4-line classification response on the old model suddenly produces an 11-line response with explanation, caveats, and a friendly closer. If the output is consumed by downstream code that expects the short shape, the pipeline breaks.

Diagnostic: run the old template against the new model, compare output length against the eval set's expected lengths. If length is up more than 30 percent, you are probably in verbosity drift. Fix: tighten the output contract. "Respond with exactly one line in the format `category: `. Do not explain. Do not apologize." Add this to the template and rerun evals.

Failure mode B: tool-use or format protocol change.

This is the one that breaks agentic templates hardest. The tool-calling protocol, system prompt format, or extended-thinking surface can change subtly between versions. If your template embeds assumptions about which keys are in the tool-call block, or whether thinking tokens are visible, those assumptions can quietly shift.

Diagnostic: run your tool-use templates through the new model's tool-calling surface and check the shape of what comes back. Fix: version-guard tool-use glue code. Every new model release should bump a version in your tool-orchestration layer, with a test that says "this version of the glue code is certified against this model release." If the model version changes and the glue version has not been certified, fail loud.

5 The rewrite rules that give you upgrade durability.

The rewrite rules that give you upgrade durability.

We have eight rules. They are boring. They are also the reason our break rate has fallen from 14 percent on the Sonnet 4 upgrade to under 2 percent on Opus 4.7.

Rule 1: one output contract, explicit.

Describe the output in explicit terms. Required fields, types, format. Not "something like this example." If JSON is expected, say it is JSON and specify the schema. If a single line, say exactly one line.

Rule 2: XML-tagged inputs.

Inputs go in tagged blocks: <context>, <task>, <constraints>, <input>. Tags are self-documenting and survive tokenization changes better than prose.

Rule 3: refusal and uncertainty are named responses.

Templates should explicitly tell the model what to output when it lacks sufficient input, when the input is ambiguous, or when the task falls outside scope. Name the response shape: {"status": "insufficient_info", "missing_fields": [...]}.

Rule 4: temperature tied to task.

Classification and extraction: temperature 0. Analysis and reasoning: 0.2 to 0.4. Editorial drafting: 0.6 to 0.8. Templates that left temperature on a default "whatever feels creative" setting were the most likely to drift across models.

Rule 5: no dependency on default voice or default length.

If the template relies on the model's default behavior to fill in gaps, those gaps will fill differently on the next model. State what you want. Do not rely on defaults.

Rule 6: version-tag every template.

Each template has a version number and a changelog. When a model upgrade requires a rewrite, the new template gets a bump. The eval results table has one column per model release, one row per template, so we can see at a glance which templates have been certified against which models.

Rule 7: decouple prompt from glue.

The prompt lives in one file. The code that calls the API, parses the response, handles errors, and routes to downstream code lives in another file. When a model upgrade changes response shape, only the glue changes. The prompt stays stable.

Rule 8: write the eval before the prompt.

This is the rule most teams skip and the one that pays the biggest dividends. Before you write the prompt, write the eval: 30 to 80 example inputs and their expected outputs. The prompt is done when it passes the eval. The prompt survives upgrade when the eval passes on the new model. Without the eval, you cannot know.

Our eval harness sample shows this in practice for a document-summarization template.

6 The eval harness that catches breakage within a day.

The eval harness that catches breakage within a day.

Every template has an eval suite. Every eval runs on every model release. The running-time budget is one operator-day.

Our harness is boring. 140 templates. Each has 30 to 80 evaluated examples. Scorers are per-template: exact match for classification, JSON schema validation for structured output, LLM-as-judge with rubric for editorial output, ROUGE for summarization baselines. We keep a "last certified model" column per template. Green means this template has been tested on this model release and passed. Amber means it has been tested and regressed. Red means it has not been tested yet.

Release morning: we run the full library against the new model. Everything that was green stays green, turns amber, or turns red. Amber and red templates get triaged into the two failure-mode categories above. We then allocate the rewrite: verbosity-drift templates get output-contract tightening, format-protocol templates get glue-code updates.

The harness we ship for clients as part of engagements is smaller (10 to 30 templates per client, depending on service line), but follows the same shape. We teach clients to run their own harness at every model release. It becomes part of their standard operations, not something they have to remember to do.

7 What we do on release morning.

What we do on release morning.

This is the sequence we follow whenever a new model ships. It has not deviated in six releases.

Hour 0 to 1: scan the release notes.

Read the model card, release notes, and any blog post. Flag anything about extended thinking, tool use, context window, or output format. These are the four surfaces where change matters.

Hour 1 to 2: flip the eval harness to the new model.

Single config change. No template changes. We are running yesterday's prompts against today's model, which is exactly the comparison we want.

Hour 2 to 5: run the full library.

Parallelized. 140 templates. All evals. Cost is typically $30 to $80 in API spend for the full run, depending on eval dataset size.

Hour 5 to 7: triage reds and ambers.

Categorize each failing template into verbosity drift or protocol change. Assign to operators. For protocol changes, the glue-code owner takes them. For verbosity drift, the template owner takes them.

Hour 7 to 16: rewrite and rerun.

Rewritten templates get the next version number, a changelog entry, and rerun against the new model. Also rerun against the prior model; backward compatibility matters.

Hour 16 to 20: certify and roll out.

Green-light the model version in our configuration. Flip client-facing workflows over to the new model on the next scheduled batch run (never mid-engagement).

Hour 20 to 24: write the release postmortem.

What broke. Why. What rule in section 5 would have caught this in design. If no existing rule would have caught it, add a new rule. Every upgrade teaches us something, and the teaching compounds.

The goal is never "zero breakage on upgrade." The goal is "low-enough breakage that the release is an opportunity, not a crisis." We hit that goal. You can too, and the eight rules above are the whole recipe.

→ Related work

Sample S8

Eval harness example →

Golden dataset, scorers, regression detection for a document-summarization template.

Sample S10

Prompt library example →

Five prompts versioned v1 through v3 with per-version eval scores.

Essay E6

Why we stopped writing one-shot prompts →

The prompt-as-code discipline that makes upgrade durability possible.