The incident response runbook nobody had time to write

1 Why the runbook is always missing

Why the runbook is always missing.

Every engagement, we ask to see the incident response runbook. About seventy percent of the time, the answer is some variation of "we have a Google Doc somewhere, it is probably out of date".

The reason is not negligence. It is economics. Writing an IR runbook from scratch is a twenty-to-forty-hour task for a security lead, and at every stage of a startup's life there is something more urgent. The document only pays off when an incident happens, which is statistically rare, and by the time an incident actually happens the runbook is six months stale regardless.

The result is a consistent pattern: the runbook exists in some form for a SOC 2 audit, gets a yellow-highlight annual review, and then stays untouched until the next audit. The on-call engineer at 3 AM, in the middle of an actual incident, does not read it. They page the security lead directly, or they improvise from half-remembered incidents at a previous job. This is the runbook problem. It is not that the document is badly written. It is that it is written for the wrong reader.

2 The three sections that actually matter

The three sections that actually matter.

A runbook the on-call reads at 3 AM has three sections that carry the weight. Everything else is scaffolding.

Section one: the decision tree. Not a narrative about severity classification, but a literal yes-or-no tree the on-call can walk. "Is customer data possibly exposed? Yes -> go to branch A. Is production traffic affected? Yes -> go to branch B. Can you contain the affected system by disabling a flag or revoking a token? Yes -> do that first, then go to branch C." This is short, opinionated, and specific to the architecture. It tells the on-call what to do first, not what to think about first.

Section two: the paging list. With names, phone numbers, timezones, and the exact escalation rule. "For customer-data incidents, page the security lead first (Alex, +1-555-..., Pacific). If no response in 10 minutes, page the CTO (Dana, +1-555-..., Eastern). For production-down incidents, page the on-call engineer first, security lead second. For billing-system incidents, page the CFO first as a courtesy, then the security lead." This is the section that is always missing or stale, and it is the section that matters most at 3 AM.

Section three: the evidence checklist. What to snapshot before changing anything. CloudTrail exports. Container images. Database snapshots. Log bucket objects with timestamps. This is the section that protects the post-incident review and any legal process that follows. It is also the section the on-call is most likely to skip under time pressure unless it is a checklist they can physically work through.

Every other section of an IR runbook is useful, but if you write only those three sections well, you have a runbook that functions. If you write beautifully structured process narratives and those three are vague, you have a document for auditors.

3 Where Claude helps and where it does not

Where Claude helps and where it does not.

Claude produces an excellent first draft of the scaffolding. It produces a dangerous first draft of the load-bearing sections if the team does not push back.

The scaffolding - the definitions, the severity matrix, the communications template, the regulatory reporting section, the post-mortem structure - is well covered in training data and Claude will produce a competent draft that is accurate to industry norms in about an hour of work.

The load-bearing sections are different. The decision tree has to reflect this specific architecture's containment options. The paging list has to reflect this specific team's operational reality. The evidence checklist has to reference this specific infrastructure's actual log locations. Claude does not know any of this. It will produce a plausible-looking version that is specific-feeling and confidently wrong in detail. The engineer who trusts that draft without editing is setting up the 3 AM on-call to paste a checklist of CloudTrail S3 paths that do not exist.

The practice we use with clients: Claude drafts the scaffolding unattended, and the load-bearing three sections are written by the security lead in a working session, with Claude assisting only as a formatter and a consistency checker. The session takes three to four hours. The output is a runbook that is both comprehensive (because Claude handled the scaffolding) and operational (because the human wrote the parts that need specific truth).

4 The dry-run that converts the document into a working runbook

The dry-run that converts the document.

The runbook does not become real until it has been used in a simulation at least once, with a different human than the one who wrote it.

The simulation is cheap: book ninety minutes, invent a scenario, and put one engineer who was not in the writing session in the "on-call" chair. They receive a fake page. Their only guidance is the runbook. They walk the decision tree, try to page the listed people (a dry-run page, with a shared understanding that recipients will respond with "acknowledged, dry-run"), and work through the evidence checklist.

The dry-run reveals the phone numbers that are out of date, the S3 paths that have changed, the decision-tree branches that lead to dead ends, the containment actions the on-call does not have permission to take. Every dry-run we run surfaces between four and ten changes to the document. That is the yield. The runbook that survives two or three dry-runs is a runbook that will hold up in a real incident.

We schedule the dry-run quarterly. It is short, it is cheap, and it is the only practice we have found that keeps the runbook current. The document on its own, regardless of how well it was written or whom with, is stale within two quarters. The dry-run keeps it honest.

If you do nothing else with this essay, run one dry-run of your current runbook this week. You will find out whether the document is operational or ceremonial in ninety minutes, and the finding is the starting point for the work.

E18.X Related work

Guide

Series A security readiness

Where incident response sits in the wider posture framework.

Service

Cybersecurity & pentesting

Runbook drafting plus dry-run is typically a two-week engagement.

E18.S Subscribe

One essay a week. No filler.

Four pillars, one email every Tuesday. If we have nothing worth sending, we skip the week.

The incident response runbook nobody wrote.

Why the runbook is always missing.

The three sections that actually matter.

Where Claude helps and where it does not.

The dry-run that converts the document.

One essay a week. No filler.