The old shape.
Three days. Day one: data-flow diagram on a whiteboard. Day two: STRIDE run across every flow and element, hand-capturing threats into a spreadsheet. Day three: mitigation mapping, prioritization, decisions.
The three-day workshop had a specific failure mode that everyone in the industry recognized and nobody quite solved. Day one was slow because getting the DFD right was genuinely hard collaborative work. Day two was tedious because STRIDE is a structured mechanical exercise and humans tire of it by hour three. Day three was rushed because the team was exhausted from day two and wanted to go home. The result was consistent: the first few trust boundaries were thoroughly covered, the middle was skimmed, and the last pieces got a quick "no threats identified, move on" treatment.
We ran threat models this way for years. They produced real value but the coverage was uneven, and the exhaustion factor meant that running a second pass a quarter later was a hard sell. What STRIDE-by-hand gave you was a snapshot that got stale, because the process was too painful to re-run against changes.
The new shape.
We split the three hours into two distinct halves, and we only put humans in the room for the second half.
In the first ninety minutes, the security lead (sometimes solo, sometimes with one engineer) drives Claude through the modeling pass. This is a working session with a model, not a meeting. The lead drops the architecture diagram (Mermaid source or a description) into the context, along with the system's data types and trust boundaries. A structured prompt sequence walks through STRIDE category by category, element by element. The output is a draft threat table with proposed threats, each one tagged with category, element, and severity.
The first ninety minutes produces a long, messy draft. Claude is an aggressive threat-finder; it will surface thirty or forty threats for a medium-sized system, many of which are real and several of which are either duplicates, misattributed, or not applicable. This is a feature, not a bug. The alternative - a human starting from a blank spreadsheet - produces a shorter, cleaner draft that also has lower coverage.
In the second ninety minutes, the humans gather. Now it is the team of three or four (security lead, platform engineer, maybe a product manager) reviewing the draft table. This is where the room mode matters: humans are much better at evaluating a proposed list than at generating one from scratch. They read through the thirty threats. They merge duplicates. They mark "not applicable" where Claude misunderstood the architecture. They elevate severity where Claude was too timid. They add the three or four threats Claude missed because they require knowledge of this specific team's operational reality. In ninety minutes, a messy draft becomes a published threat model.
The prompt sequence we use.
We run the first ninety minutes as a three-prompt sequence. Each prompt has a specific role.
Prompt 1: DFD confirmation. The operator pastes the architecture description. The prompt says: "Produce a minimal Mermaid data-flow diagram. Label every element as process, data store, external entity, or data flow. Mark trust boundaries with the standard dotted-line convention. If anything is ambiguous, list your assumptions at the end."
The output is a Mermaid diagram plus a list of assumptions. The operator reads the assumptions, corrects any that are wrong, and the diagram becomes canonical for the rest of the session. This step alone catches the communication-gap errors that used to happen on workshop day one.
Prompt 2: STRIDE per element. The prompt takes the confirmed DFD and iterates: "For each element in this DFD, consider each STRIDE category. For every plausible threat, produce one table row with (id, element, category, threat_description, attacker_capability, severity_1_to_5, assumptions). Do not filter for 'likely'. Filter only for 'plausible against this architecture'. Aim for coverage."
The output is the messy draft table. Thirty to sixty rows for a typical Series A architecture. This is the bulk of the modeling work, and the "do not filter for likely" clause is load-bearing. Without it, Claude softens the list and you get the same uneven coverage as a tired human. With it, you get everything that is plausible and you filter during the human pass.
Prompt 3: mitigation mapping. Once the human pass has trimmed the draft to the real threats, a third prompt runs: "For each remaining threat, propose two candidate mitigations with approximate implementation effort (1-5), and note which existing controls, if any, already partially mitigate it. Mark each mitigation as preventive, detective, or responsive."
This produces the sprint-planning material. The human pass validates, but the mitigation menu is something Claude does well because there is a stable literature of mitigations per STRIDE category.
Where the human has to overrule.
Four places, consistently.
1. Severity calibration. Claude tends to rate spoofing and tampering threats as severity 3 or 4 by default. Many are actually severity 2 in the presence of existing controls (TLS everywhere, SSO federation). The human with actual knowledge of the deployed controls downgrades correctly. Claude does not have a reliable view of operational reality.
2. Shared-blind-spot threats. When the architecture uses a specific vendor or service pattern that is not well covered in training data (a newer cloud-native product, a novel internal pattern), Claude will miss threats that a human with deployment experience sees immediately. The fix is that the human pass explicitly asks: "what threats would we see that a generic checklist would miss, given the specifics of our stack".
3. Aggregation and attack paths. STRIDE enumerates threats per element. It does not chain them. The human pass has to look for combinations: threat A in element X becomes severity 5 if it can be combined with threat B in element Y. Claude is mediocre at this because the structured enumeration does not surface the chain. The human walks the DFD mentally and asks "what is the worst end-to-end path".
4. Organizational threats. Employee turnover, on-call gaps, approval-policy workarounds, vendor-relationship drift. None of these are in STRIDE, but they affect the threat model in practice. Claude does not see these. The human has to add them explicitly.
Output: the deliverable the client keeps.
After the three hours, the client keeps three things. Each is generated by the pipeline, but each is signed off by a human before shipping.
First, the DFD with trust boundaries, as Mermaid source in the handbook. Because it is source, it can be regenerated and diffed against the architecture on the next quarterly pass.
Second, the trimmed threat table: fifteen to twenty-five real threats, each with severity, attacker capability, applicable controls, and a mitigation-sprint mapping. Published in the handbook under a stable section so it can be revisited and updated.
Third, the sprint list: the prioritized set of mitigations that come out of the session, tied to threat IDs so that when an engineer pushes a fix, they can link the PR back to the specific threat it closes. This is the artifact that actually drives remediation work. It is also the artifact that makes the quarterly re-run valuable: we re-run the pipeline, diff the threat table, and see which threats are now closed and which new ones have appeared.
The quarterly cadence is the thing that the old three-day-workshop shape did not permit. When the cost drops from three days to three hours, running the exercise quarterly becomes feasible. That is where the real compounding value shows up: not in the first workshop, but in the fifth one, when the team has a four-quarter view of how the threat model has evolved with the architecture.
One essay a week. No filler.
Four pillars, one email every Tuesday. If we have nothing worth sending, we skip the week.