Prompt engineering for ops
The prompts that run through this API. Real examples.
For engineers who are past the “hello world” tutorial and ready to ship. API fundamentals, caching, tool use, batch, observability, evals, and deployment - the pieces you actually need to build a reliable production pipeline.
Unless you have a specific reason (non-supported language, exotic deployment), use Anthropic's official SDK. TypeScript and Python both have strong support. Using the SDK buys you retries, streaming, type safety, and version stability for free.
// TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 1024,
system: "You are a senior security researcher.",
messages: [
{ role: "user", content: "Summarize OWASP LLM Top 10 in 5 bullets." },
],
});
That is the minimum viable call. Everything else builds on this shape.
Claude offers a family of models with different capability and cost profiles. Rough mental model:
A common architecture is a router: Haiku classifies the incoming request, then routes to Sonnet for most work and Opus for specific hard subtasks. This often cuts cost by 60 to 80 percent vs sending everything to Opus.
Prompt caching lets you reuse static parts of a prompt across calls. System prompts, tool definitions, long context, and few-shot examples are ideal candidates.
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: longStaticSystemPrompt,
cache_control: { type: "ephemeral" },
},
],
messages: [
{ role: "user", content: userQuery },
],
});
Behavior:
Put static content as early in the prompt as possible. Put dynamic content (user messages, retrieved documents) after the cached content. The cost model is in the cost-modeling guide.
Claude can call tools (functions) you define. This is how you give it the ability to read databases, hit APIs, write files, or trigger actions.
const tools = [
{
name: "get_customer",
description: "Look up a customer by ID. Returns name, email, plan.",
input_schema: {
type: "object",
properties: {
customer_id: { type: "string" },
},
required: ["customer_id"],
},
},
];
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
tools,
messages: [
{ role: "user", content: "What plan is customer C-4821 on?" },
],
});
// If response.stop_reason === "tool_use":
// 1. Extract the tool call from response.content
// 2. Execute your tool
// 3. Send the result back in the next turn
Tool use is powerful and easy to misuse. Best practices:
do_thing tool with 12 optional parameters.Anytime a human is waiting on the response, stream it. Users tolerate a long response if they see tokens arriving; they will abandon a spinner after 3 to 5 seconds.
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 4096,
messages: [{ role: "user", content: query }],
});
for await (const chunk of stream) {
if (chunk.type === "content_block_delta") {
process.stdout.write(chunk.delta.text ?? "");
}
}
For non-user-facing backend work (batch jobs, scheduled runs), do not stream. It adds complexity for zero user benefit.
For workloads that do not need real-time responses, the batch API is roughly half the price of real-time. Submit a batch job with up to 100,000 requests, poll for completion, fetch results.
Good candidates:
Bad candidates: anything a human is actively waiting on.
Do not try to parse prose. Ask for JSON, specify the schema, validate on receipt, retry on failure.
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
system: "You are an extractor. Output only valid JSON matching the provided schema.",
messages: [
{
role: "user",
content: [
`Extract structured info from the following text.`,
``,
`Schema:`,
`{`,
` "name": string,`,
` "email": string,`,
` "company": string,`,
` "intent": "evaluating" | "buying" | "browsing"`,
`}`,
``,
`Text: ${inputText}`,
].join("\n"),
},
],
});
const parsed = JSON.parse(response.content[0].text);
// Validate against schema (zod, ajv, etc). Retry on failure.
For higher-stakes structured output, use tool use with tool_choice forced to your schema tool. That forces the response into the shape you specified, no string parsing.
The SDK retries transient errors by default. Configure:
Application-level retries for non-transient errors: if Claude returns an empty JSON or malformed response, retry with a corrective message (“previous response was not valid JSON; please try again”). Cap at 2 to 3 retries to avoid loops.
Every production Claude call should log:
At minimum, a daily dashboard with: total requests, total cost, average latency, P95 latency, error rate, cache hit rate. If any of those moves sharply, something changed and you want to know.
See the evals primer for the full story. For API integrations specifically:
Running evals costs tokens. Budget for them. A nightly eval run over 200 items at 8,000 tokens each is about 1.6 million tokens - real money, but worth it.
For client engagements, enable Zero Retention on your API key. This means the API does not log prompt or response content on Anthropic's servers beyond what is needed to process the request.
For your own logs, decide:
This is a security and privacy question, not a debugging question. Get it right before you scale.
We build production AI pipelines on Claude as part of product-development engagements. If you are architecting one and want a second opinion, start a conversation.
The prompts that run through this API. Real examples.
The full framework. API hygiene is one layer of six.
What all this infrastructure actually costs per action.