Claude is not a pentester. Claude is a force multiplier for one.

1 The marketing claim and what it hides.

The marketing claim and what it hides.

"AI-powered penetration testing" is showing up on the landing pages of at least a dozen vendors we have watched emerge over the last year. The pitch is the same every time: faster, cheaper, more coverage, same quality. Sometimes with a badge that says the report was "validated by a CISSP."

We have read some of those reports. They are recognizable within two pages. Findings are restated with confident language but thin root-cause analysis. Proofs-of-concept link to public CVE pages instead of demonstrating exploitability in the client's actual environment. Severity ratings are clustered at High and Critical in suspicious proportions. The remediation section reads like Stack Overflow on a calm day. The CVSS vectors do not match the described attack path.

The product is not a pentest. The product is a vulnerability scan, dressed up by a language model into a shape that looks like a pentest report, sold at pentest prices. The client cannot tell from the outside. They pay, they file it with compliance, and a year later something gets breached that was listed as "informational" because nobody looked at it.

We are going to ship Claude on every engagement. We are also going to be specific about what that means, because the alternative is becoming the vendor we are warning you about.

2 Where Claude is genuinely strong.

Where Claude is genuinely strong.

The work that used to eat 30 to 40 percent of a pentester's week is the work Claude does well. That is the honest gain.

Reconnaissance synthesis.

Pulling together subdomain enumeration, exposed-asset inventory, tech fingerprints, historical DNS, and public breach data into a coherent picture. The operator used to spend two days on this at the start of every engagement. Claude produces a first pass in 30 minutes that is 80 percent of what the operator would have written, with citations. The operator spends the reclaimed time actually testing the asset surface instead of cataloging it.

Attack-path reasoning over inventory.

Given a list of identities, roles, services, and trust relationships, enumerate the plausible paths from low-privilege foothold to crown-jewel data. Claude is excellent at this, and specifically at not missing the third-order path a human might skip because it is boring. Pairs well with extended thinking on Opus: see our essay on where that budget is worth spending.

Report structure and voice.

Taking rough operator notes and turning them into a finding with a clean context, a precise reproduction sequence, an evidence block, a risk rating with justification, and a remediation path. The operator still writes the claim and picks the severity. Claude handles the form, including the boring but important glue text that makes a 60-page report readable instead of a wall of raw output.

Payload and config review.

Reading through IAM policies, Terraform, Dockerfiles, nginx configs, and CI pipelines to flag misconfigurations. Claude catches roughly the same set of misconfigurations a senior operator would, at roughly 10x the speed. The savings go into the findings that actually require exploitation creativity.

3 Where Claude is dangerous without a human.

Where Claude is dangerous without a human.

These are the places we do not let Claude's output reach a client unreviewed. Not because the model is bad at them, but because the failure mode is "confident and wrong" and the cost of confident-and-wrong in security is a breach.

Exploitability determination.

Whether a theoretical vulnerability is actually exploitable in the client's environment, given their specific network posture, authentication, WAF rules, and compensating controls. Claude will produce a plausible narrative of exploitability. Sometimes it is right. Sometimes it misses a compensating control that makes the exploit path a dead end. An operator has to check, hands on keyboard, before we call something High.

Proof-of-concept execution.

We do not let Claude run the actual exploit against the production target. This is a hard rule, not a preference. The operator performs the PoC, captures the evidence, and Claude helps write up what happened after. Reasons: legal authorization is tied to the named human in the rules of engagement. Kill-switch judgement is tied to the same human. Claude does not have the legal authority to press the exploit button on someone else's systems, and we do not want to be the firm that argues otherwise.

Severity calibration.

CVSS is not a formula you can hand off. It is a judgement call about the client's specific context, their threat model, and the way the finding chains into other findings. Claude's default calibration is systematically biased toward higher severity, because the training distribution rewards concerned-sounding security content. The operator sets severity. Claude provides the rationale template.

Novel-finding framing.

If something truly new turns up in an engagement, a class of bug we have not seen before, the framing has to come from the human who found it. Claude is a remix engine over prior art. Novel bugs get flattened into the nearest familiar shape, which loses the thing that made them novel. The operator writes the first draft of novel findings. Claude polishes afterward.

4 The validation discipline we refuse to skip.

The validation discipline we refuse to skip.

Every finding in a NexcurAI report passes through a three-step gate before it reaches the client. No exceptions. No "it was obviously right, so we saved time on the review."

Gate 1: operator reproduces.

The named engagement operator reproduces the finding in the client's environment and records evidence (screenshots, request/response pairs, logs). If the operator cannot reproduce, the finding is marked as "unable to confirm" and either dropped or investigated further. It is not shipped as a finding on Claude's say-so.

Gate 2: severity review.

A second operator reviews severity and CVSS. The review explicitly checks whether the model's default rating survives human context. About 15 percent of draft findings get a severity adjustment at this gate. It is never automated.

Gate 3: named reviewer.

The report lists the named reviewer and the review date. This is a contractual commitment, captured in our rules of engagement. If a finding in the report was never looked at by a named human, we are lying on the cover page. That is the line we will not cross.

5 What we tell procurement when they ask if it is "AI-driven."

What we tell procurement when they ask if it is "AI-driven."

The answer we give, and that we train clients to give their own procurement teams, is this:

Every finding in this report was produced by a human operator working alongside Claude. The operator is named on the cover page and is legally responsible for the conclusions. Claude accelerated reconnaissance, inventory synthesis, report structure, and configuration review. Claude did not execute exploits, did not set severity, and did not sign the report.

That sentence is in every statement of work. It is repeated in the handbook we give the client. It is repeated on the rules-of-engagement page. It is the sentence that distinguishes a pentest from a marketing artifact.

If a vendor cannot tell you which parts the model did and which parts a named human did, they are either hiding a scanner behind a prompt or have not thought carefully about where the line should be. Ask for both. The better answer is specific.

Claude is not a pentester. It is the best collaborator a pentester has ever had. Firms that understand the difference will deliver better reports at lower cost. Firms that do not will get caught by the first breach that reveals what was actually in the box.

→ Related work

Sample S3

Pentest report example →

The shape of a report that shows its work. Bad-vs-good pairs, validated PoCs, named reviewer.

Essay E2

The pentest report as literary form →

Taxonomy of bad pentest reports and what Claude specifically changes about the form.

Service

Cybersecurity and pentesting →

How we structure engagements around the human/Claude split this essay describes.