What tool use actually unlocks

1 What changes when you give a model hands.

What changes when you give a model hands.

Without tool use, Claude can reason about a system. With tool use, Claude can query the system, modify it, and respond to what it finds. The difference in what you can build is much larger than the one-sentence description makes it sound.

Without tools: "Here is what a good backup strategy looks like for your database." With tools: "Your backups are running. The most recent one succeeded 4 hours ago. Your retention policy is 30 days. You have 14 days of successful backups. Your last restore test was 6 months ago. Do you want me to kick off a restore-to-test now, or open a ticket for you to schedule one?"

Tool use closes the gap between advice and execution. It turns "here is what you should do" into "here is what is actually happening." That gap is where most of a consultant's delivery value lives. Closing it is the single largest productivity lever in our workflow.

It is also the single largest source of new risk, because a model that can do things can also do things wrong. The rest of this essay is about how to get the leverage without the liability.

2 Four shapes of tool use that pay off.

Four shapes of tool use that pay off.

Not every task benefits from tool use. The ones where we have measured a real lift fall into four shapes.

Shape 1: read-only inspection.

Give the model read access to inventories, logs, configs, dashboards. Let it gather context before it answers. The inspection tools return facts; the model synthesizes them into an answer. No mutation. This is the lowest-risk and highest-leverage shape of tool use, and where we start on every engagement.

Example: a security engagement where Claude has read access to the client's IAM inventory via a scoped AWS API role, to their Terraform repo via a read-only GitHub app, and to their security-team Slack channel (history only). The agent can answer "what roles currently have Administrator privileges and when were they last used" in 30 seconds, with cited evidence, instead of the engineer spending two hours pulling the same data by hand.

Shape 2: structured evidence capture.

Instead of returning a free-form answer, tools that let the model produce structured evidence: filling a finding template, attaching a screenshot to a ticket, writing a JSON report into a shared document. The tool enforces shape. The model's role is to gather the data and call the tool with the correct fields.

This is particularly useful for pentest reporting, incident timelines, and compliance artifact generation. The tool is the shape the artifact must take; the model is the operator that populates it.

Shape 3: reversible automations.

Tools that make changes, but changes that can be trivially reverted: creating a draft PR, opening a ticket, staging a configuration file, posting to a proposal-channel instead of the final channel. The model can iterate on the change; a human approves the irreversible step (merging, deploying, sending).

This is where most of the "agentic" pipelines we actually ship live. The agent proposes; the human disposes.

Shape 4: human-in-the-loop escalation.

Tools specifically designed to pause execution and ask a human. "I have three candidate fixes; which should I apply?" "The diff I'm about to push changes a security-sensitive config; please review." "This takes longer than 90 seconds; I'm going to stop and hand back to you."

This is the shape that most teams underbuild. They write tools that do things autonomously, then regret it the first time the model does the wrong thing. An escalation tool is a first-class citizen, not a fallback.

3 Tool design: treat each tool as a public API.

Tool design: treat each tool as a public API.

The tool description is the contract between the model and your system. Write it as if a careful junior engineer is going to be the only user of the tool, and you are their team lead writing the docs.

Clear scope.

Each tool does one thing. "Get user by ID" is a tool. "Get user by ID and also update their email and also archive them" is three tools. Breaking compound operations into primitives gives the model much better tool-selection behavior and makes testing simpler.

Precise parameter types.

JSON Schema for every parameter. Enums for string fields with fixed vocabularies. Descriptions that explain what the field is and give examples. The model's tool-calling quality is proportional to how specific your schema is.

Honest descriptions of side effects.

If the tool has side effects, say so explicitly in the description. "This writes to the audit log. This sends an email. This costs $0.12 per call." When the model knows the consequences, it calls the tool more judiciously, and you get better human-readable justifications for why it chose to call.

Explicit failure shapes.

Tools can fail. Describe how. "Returns an error with code 404 if the user does not exist; code 403 if the caller lacks permission; retry-safe on 5xx." The model can plan around failures it knows about and cannot plan around failures it is surprised by.

Rate limits and quotas.

If the underlying API has limits, surface them in the tool description. Do not let the model try to call your production API 80 times per second without knowing it will be rate-limited.

4 The safety rails we put around every tool-using flow.

The safety rails we put around every tool-using flow.

Every agent we ship has the same three safety rails, applied uniformly regardless of how trustworthy the task is or how capable the model is.

Rail 1: per-session cost and call budget.

Every agent session gets a budget. Maximum API cost. Maximum tool calls. Maximum wall-clock runtime. If any are exceeded, the agent halts and escalates. This catches both infinite loops and cost runaways before they become incidents.

Rail 2: destructive-action approval.

Any tool that modifies state requires explicit approval before the first invocation in a session, or the first invocation with a given set of parameters. "You want to delete this user? Confirm." This is annoying when the approver is sitting at a keyboard watching the agent. It is essential when the agent runs unattended.

Rail 3: full audit log, retained for 90 days.

Every tool call, every input, every output, every approval. Searchable. Tied to a session ID that the user can reference. When something goes wrong (or, more importantly, when a client asks "why did you do that"), we can reconstruct the entire agent session. This is the gate between the agent being a productivity tool and a compliance problem.

5 The liability we refuse to take on.

The liability we refuse to take on.

There are four categories of tool use we will not build for a client. Not "will not build carefully." Will not build at all. These are the lines we hold regardless of how compelling the business case looks.

Autonomous money-movement.

Tools that transfer funds, issue refunds, modify billing, or otherwise move real dollars without a human in the loop. Every version of this we have seen pitched has failed in review. The risk of a prompt-injection attack causing unauthorized transfers is real, and the asymmetric cost of being wrong (a single breached agent could lose six figures) exceeds the productivity gain. A human approves the money move. Always.

Autonomous production database writes.

Schema migrations, user-record updates, data deletion. The blast radius of a wrong action here is too large and too permanent. The agent can draft the migration and write it to a staging environment. A human promotes.

Autonomous external communication on behalf of the client.

Sending email to customers, posting on behalf of the company brand, engaging with journalists. Too easy to produce a message that sounds right and lands wrong. The agent drafts. A human sends.

Autonomous credential or permissions management.

Creating IAM users, rotating secrets without supervision, granting admin. The security-blast-radius math is the same as money movement: the downside is unbounded, the upside is "saves a human ten minutes." Not worth it.

The pattern in all four: high-stakes irreversible actions stay in human hands. The agent prepares, proposes, and drafts. The human ships. This is the design principle that lets us ship real tool-using agents without putting clients or ourselves at catastrophic risk.

Tool use is the feature that makes Claude actually useful in production. It is also the feature that will produce the first generation of high-profile AI incidents. The firms that figure out the design discipline in the next 12 months will deliver much larger productivity gains than the ones who either stay defensive or go too fast. Pick your rail carefully.

→ Related work

Guide G5.1

Shipping AI features safely →

The full review checklist we apply before a tool-using agent reaches production.

Guide G8.2

Working with the Claude API →

Tool-use plumbing, orchestration patterns, and observability.

Service

Product development →

How we build tool-using agents for client products, with the safety rails in this essay.