The self-critique loop

1 The shape of the loop

The shape of the loop.

Generate a draft. Ask for a list of the draft's weakest claims. Revise only those claims. Repeat once.

That is the whole pattern. What makes it work in practice is the constraint that the critique step produces a list, not a rewrite. The list is cheap for a human operator to scan, and it forces the model to articulate the kind of mistake before attempting to fix it. A critique that says "paragraph 3 has a weak citation" is cheap to verify. A silent rewrite of paragraph 3 is not.

In production, our pipeline runs self-critique at three stages of a handbook build: after the section outline, after the first prose draft of each section, and after the full handbook is assembled. Each stage has a critique prompt tuned to the kind of failure that tends to happen at that stage. Outline critique hunts for missing scope. Prose critique hunts for unsupported claims and repetition. Assembly critique hunts for contradiction across sections and for handoff language that would confuse the human reader.

The discipline is that the critique prompt never gets to rewrite the artifact. It only produces a list of concerns with references to the location. A second pass, a revision prompt, takes the original plus the critique and produces the revision. This separation matters. When we let critique-and-revise happen in a single pass, the model finds fewer issues and the issues it finds get softened in the rewrite. When they are separated, the critique is sharper and the revision is targeted.

2 Why it works: generation and evaluation use different muscles

Why it works.

Asking a model to write something good and asking a model to notice what is not good about something are two different tasks, even when both are asked of the same weights.

The generation task pulls toward coherence. A draft written in one pass optimizes for reading well, which in a model means sentences flowing, transitions being smooth, the tone being consistent. Coherence is a local property. A draft can be coherent and still contain a claim that is unsupported, a section that is redundant, a conclusion that does not follow from the evidence presented. These are global properties that generation-mode tends to gloss over because hitting them requires stopping the flow.

The evaluation task, presented separately, pulls the model toward the global properties. When the prompt is "list the three weakest claims in this draft and explain why each one is weak", the model is not trying to produce readable prose. It is scanning for gaps. The output is often ugly - bullet points, hedging, mixed quality - but it is sharper than anything the generation prompt would have flagged.

This is the same reason human editors are a different skill from human writers. The two are compatible in one head, but even then, most writers will tell you that switching modes is a distinct act. They read the draft the next morning. They put on a different lens. We are doing the same thing with a second prompt.

3 The actual prompt we ship

The actual prompt we ship.

Here is the prose-stage critique prompt, slightly anonymized. It lives in our template repo and we tweak it per service line.

You are the editor of a Signature Handbook section. The draft below
was written by an operator assisted by a generation model. Your job
is to identify what would make a careful senior reader trust this
section less, not to rewrite it.

Return a numbered list of three to seven concerns. Each concern
must be:
  - Located (which paragraph or sentence)
  - Named (what kind of flaw: unsupported claim, repetition,
    weak transition, unclear antecedent, generic language,
    contradiction with earlier section, missing caveat)
  - Specific (what would need to change, in one sentence)

Do not rewrite. Do not soften. If the section is strong, return
three concerns anyway, because every draft has a weakest link.

Draft:
{{section_draft}}

Prior sections (for contradiction check):
{{prior_sections_summary}}

Two things about this prompt are load-bearing. First, the cap of three to seven. Without a cap the model either finds two surface issues and stops, or produces an unhelpfully long list. Three to seven is the range where the model has to prioritize but also cannot shortcut. Second, the "if the section is strong, return three concerns anyway" clause. This is the single most important line. Without it, self-critique of a good draft produces sycophancy and misses the marginal improvement. With it, the model is forced to pick the weakest link even on strong work.

The revision prompt that follows is simpler. It takes the original draft, the numbered list of concerns, and instructs: "Address each concern by making the minimum local change. Do not rewrite paragraphs that the list does not flag. If you think a concern is wrong, respond with 'skipped' and a one-line reason rather than changing the text." The skip clause matters because self-critique is not infallible, and a revision loop that cannot push back produces over-correction.

4 What self-critique cannot fix

What self-critique cannot fix.

There are two failure modes that show up in our data and that no amount of self-critique will remove. They are the reasons a human reviewer still signs off every deliverable.

Failure mode one: the shared blind spot. If the generation model does not know something, the critique model run on the same weights will not know it either. We see this most often with claims about specific products, specific cloud-provider APIs, and specific laws. The draft says "AWS IAM supports attribute-based conditions via tag-equals". The critique says "no concerns on the AWS claim". Both are wrong in the same direction because both are working from the same training data, and neither has retrieved the current AWS docs. This is why we now route fact-critical claims through a tool-use retrieval step before critique even runs - and why a human checks any claim about a vendor product before the handbook ships.

Failure mode two: the voice drift. Self-critique is sharp on structure and claims. It is surprisingly blind to voice. When a draft has drifted into a more corporate register ("leveraging synergistic paradigms"), the critique step almost never flags it, because the critique model is not operating from a voice reference. We solved this by adding a third step: a voice-check pass that evaluates the draft against three exemplar paragraphs from our style guide. That step finds the drift. Self-critique alone does not.

These two failures are why self-critique is a step, not a substitute. It moves the quality floor up, but the ceiling still needs a human plus a retrieval plus a voice reference. What we have found, in practice, is that self-critique does about sixty percent of the work of a senior editor. That is a large fraction but not all of it, and the remaining forty percent is where the operator spends their time.

5 Cost and when to skip

Cost and when to skip.

Every self-critique pass is another round of tokens. For a long-form handbook section, that is non-trivial. The economics only work because prompt caching makes the draft-plus-critique round cheap.

With prompt caching on the draft (the draft is what gets re-read in the critique pass), the marginal cost of a critique is roughly the output tokens of the list plus a small read-multiplier on the draft. That is ten to twenty cents per section for a ten-thousand-token section. For a full handbook of thirty sections that is three to six dollars of critique cost, which is immaterial against the human editor time it saves.

We skip self-critique in two specific cases. First: very short outputs. If we are generating a two-paragraph executive summary, the critique step is noise. The operator can evaluate two paragraphs as fast as they can read the critique list. Second: outputs that will be reviewed end-to-end by a domain specialist anyway. For the IAM Terraform we hand off to the platform engineer, we do not run self-critique; the engineer is going to read every line regardless, and the critique pass would slow the cycle without adding signal.

Outside those two cases, self-critique runs by default. It is the single highest-yield technique we apply, and the one we think is most undervalued by teams still running one-shot prompts.

E15.X Related work

Essay

Why we stopped writing one-shot prompts

The wider argument self-critique sits inside. Multi-turn beats one-shot in almost every case.

Sample

Prompt library example

Five to eight real prompts with v1/v2/v3 and eval scores per version.

Service

Product development

How we build evals, prompt pipelines, and shipping-ready AI features.

E15.S Subscribe