G8.2 Guide · Operations

Working with the Claude API.

For engineers who are past the “hello world” tutorial and ready to ship. API fundamentals, caching, tool use, batch, observability, evals, and deployment - the pieces you actually need to build a reliable production pipeline.

Length: 45 min Audience: Eng lead / platform eng / CTO Last updated: 2026-04-19

Start with the SDK, not the raw HTTP

Unless you have a specific reason (non-supported language, exotic deployment), use Anthropic's official SDK. TypeScript and Python both have strong support. Using the SDK buys you retries, streaming, type safety, and version stability for free.

// TypeScript
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  system: "You are a senior security researcher.",
  messages: [
    { role: "user", content: "Summarize OWASP LLM Top 10 in 5 bullets." },
  ],
});

That is the minimum viable call. Everything else builds on this shape.

Model selection

Claude offers a family of models with different capability and cost profiles. Rough mental model:

Opus (flagship). Highest capability. Reserve for tasks needing deep reasoning, long context synthesis, or hard-to-get-right outputs.
Sonnet. The workhorse. Strong for most production tasks. Usually the right default.
Haiku. Fast and cheap. Perfect for classification, routing, extraction, short summaries, latency-sensitive flows.

A common architecture is a router: Haiku classifies the incoming request, then routes to Sonnet for most work and Opus for specific hard subtasks. This often cuts cost by 60 to 80 percent vs sending everything to Opus.

Prompt caching: the biggest lever

Prompt caching lets you reuse static parts of a prompt across calls. System prompts, tool definitions, long context, and few-shot examples are ideal candidates.

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longStaticSystemPrompt,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    { role: "user", content: userQuery },
  ],
});

Behavior:

First call writes the cache at a premium (roughly 1.25x input rate).
Subsequent calls within the TTL read from cache at a discount (roughly 0.1x input rate).
Cache is prefix-based. Changing any token before a cached chunk invalidates the cache.

Put static content as early in the prompt as possible. Put dynamic content (user messages, retrieved documents) after the cached content. The cost model is in the cost-modeling guide.

Tool use

Claude can call tools (functions) you define. This is how you give it the ability to read databases, hit APIs, write files, or trigger actions.

const tools = [
  {
    name: "get_customer",
    description: "Look up a customer by ID. Returns name, email, plan.",
    input_schema: {
      type: "object",
      properties: {
        customer_id: { type: "string" },
      },
      required: ["customer_id"],
    },
  },
];

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  tools,
  messages: [
    { role: "user", content: "What plan is customer C-4821 on?" },
  ],
});

// If response.stop_reason === "tool_use":
//   1. Extract the tool call from response.content
//   2. Execute your tool
//   3. Send the result back in the next turn

Tool use is powerful and easy to misuse. Best practices:

One clear job per tool. Do not ship a do_thing tool with 12 optional parameters.
Strong schemas. Use required fields, enums, and descriptions. Claude follows schemas well but only if they are clear.
Idempotent actions. If Claude retries, the second call should not double-charge, double-email, or double-delete.
Audit log. Log every tool call, inputs, outputs, and latency. When a user asks “why did the agent do X,” you want the log.
Approval gates for risky tools. For anything irreversible, require a human approval step. Do not let the agent push to prod by itself.

Streaming

Anytime a human is waiting on the response, stream it. Users tolerate a long response if they see tokens arriving; they will abandon a spinner after 3 to 5 seconds.

const stream = await client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  messages: [{ role: "user", content: query }],
});

for await (const chunk of stream) {
  if (chunk.type === "content_block_delta") {
    process.stdout.write(chunk.delta.text ?? "");
  }
}

For non-user-facing backend work (batch jobs, scheduled runs), do not stream. It adds complexity for zero user benefit.

Batch processing

For workloads that do not need real-time responses, the batch API is roughly half the price of real-time. Submit a batch job with up to 100,000 requests, poll for completion, fetch results.

Good candidates:

Bulk content moderation.
Backfilling an existing dataset with AI-generated enrichments.
Nightly summaries or reports delivered by email.
Running the eval harness against a golden dataset.

Bad candidates: anything a human is actively waiting on.

Structured output

Do not try to parse prose. Ask for JSON, specify the schema, validate on receipt, retry on failure.

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  system: "You are an extractor. Output only valid JSON matching the provided schema.",
  messages: [
    {
      role: "user",
      content: [
        `Extract structured info from the following text.`,
        ``,
        `Schema:`,
        `{`,
        `  "name": string,`,
        `  "email": string,`,
        `  "company": string,`,
        `  "intent": "evaluating" | "buying" | "browsing"`,
        `}`,
        ``,
        `Text: ${inputText}`,
      ].join("\n"),
    },
  ],
});

const parsed = JSON.parse(response.content[0].text);
// Validate against schema (zod, ajv, etc). Retry on failure.

For higher-stakes structured output, use tool use with tool_choice forced to your schema tool. That forces the response into the shape you specified, no string parsing.

Retries and rate limits

The SDK retries transient errors by default. Configure:

maxRetries - default 2. Raise to 4 or 5 for production.
timeout - default 10 minutes. Set lower for user-facing flows.
Rate limit handling - back off and retry on 429. SDK handles this; do not reimplement.

Application-level retries for non-transient errors: if Claude returns an empty JSON or malformed response, retry with a corrective message (“previous response was not valid JSON; please try again”). Cap at 2 to 3 retries to avoid loops.

Observability

Every production Claude call should log:

Request ID (correlate across logs).
Model used.
Input token count.
Output token count.
Cache hit / miss indicator.
Latency.
Cost (computed from tokens).
Error, if any.

At minimum, a daily dashboard with: total requests, total cost, average latency, P95 latency, error rate, cache hit rate. If any of those moves sharply, something changed and you want to know.

Evals in CI

See the evals primer for the full story. For API integrations specifically:

A golden dataset of 30 to 200 representative inputs.
Expected outputs or scoring criteria per input.
A runner that executes your pipeline against every input and records the output.
A scorer that produces a quality number per run.
A regression gate in CI that blocks merging if the score drops more than a threshold.

Running evals costs tokens. Budget for them. A nightly eval run over 200 items at 8,000 tokens each is about 1.6 million tokens - real money, but worth it.

Zero Retention and data handling

For client engagements, enable Zero Retention on your API key. This means the API does not log prompt or response content on Anthropic's servers beyond what is needed to process the request.

For your own logs, decide:

What do you log vs hash vs drop?
How long do you retain?
Where do logs go - a provider that has its own data-handling commitments, or internal storage?
How do you honor a customer deletion request?

This is a security and privacy question, not a debugging question. Get it right before you scale.

Deployment considerations

Secrets in env, not in code. Use a secret manager in production. Rotate API keys on a schedule.
Isolate API keys by environment. Dev, staging, and prod use different keys. A leaked dev key should not be a production incident.
Rate limit your own users. Do not let an infinite loop somewhere drain your API budget in a morning. Cap per-user requests and total daily spend.
Fallback behavior. What happens when Claude is slow or down? Timeout, show a graceful message, optionally fall back to a cheaper model or a pre-computed cached response.
Region and latency. If you have users globally, consider how latency affects the experience. Cache aggressively on the edge where possible.

Anti-patterns

Calling Claude from the client. Exposes your API key. Always proxy through your backend.
No max_tokens. The model will happily produce 8,000 tokens when you needed 400. Cap it.
Optimistic JSON parsing without validation. Wrap in try/catch, validate against schema, retry on failure.
Per-call tool definitions. Tool schemas are often static. Cache them in the system prompt.
One giant prompt. Break complex tasks into smaller tool-using steps. Easier to debug, cheaper, more reliable.
No model version pinning in contracts. Pin the model ID in your config. Upgrading is a deliberate decision with evals, not a surprise because the alias moved.

When to talk to us

We build production AI pipelines on Claude as part of product-development engagements. If you are architecting one and want a second opinion, start a conversation.

Related guideMedium · 26 min