E5 Essay · Applied AI

Prompt caching is unit economics.

Teams treat prompt caching like a performance optimization. It is not. It is the dividing line between a feature that can scale to real usage and a feature that blows up your margin the week you hit product-market fit.

Applied AI 14 min read 2026-04-15 by the operator drafting assisted by Claude
Corrections log: none yet. If you find a factual error, email hello@nexcur.ai and we will log it here, dated.
1 The mistake teams keep making

The mistake teams keep making.

A team ships an AI feature. The prototype runs fine. Usage grows. Bills grow faster than usage. Someone proposes adding prompt caching as a "performance optimization" for the next sprint.

This is how I know a team has not modeled the unit economics. Caching was not a nice-to-have that slipped down the backlog. It was a structural decision about where the cache boundary sits in the prompt. When that boundary is in the wrong place, no amount of bolt-on caching recovers the margin you have already lost.

Every Claude-powered feature we have ever built starts with one architectural question: what is the thing that stays the same across calls, and what is the thing that changes per call? The answer to that question determines whether the feature can scale. Cache design is not an optimization. It is the schema of the feature.

2 What caching actually does

What caching actually does.

A cached block of tokens is billed at 0.1x input price on read and 1.25x input price on write. That is the entire API you need to reason about.

The cache is an ephemeral (by default, 5-minute TTL) store of tokens that the model has already processed. When you mark a block as cacheable, the first call pays a write premium. Every subsequent call that hits the same cache pays a read discount. The TTL can be extended to 1 hour at a 2x multiplier if your hit rate justifies it.

The break-even is roughly two calls. One write at 1.25x plus one read at 0.1x comes to 1.35x; two uncached calls come to 2.0x. Past two, caching wins hard. At ten calls against the same cache, you are paying about 22% of the uncached equivalent for the cached portion. At fifty, about 12%.

This is not a performance story. It is a cost story. Latency improvements are real (the model skips re-tokenizing and re-attending over the cached block) but they are downstream of the economic calculus. The feature either has a cache hit rate that makes it viable at scale, or it does not.

3 The break-even math

The break-even math.

Consider a support bot with a 22,000-token system prompt (product docs, policies, tone guide).

Uncached, each call costs roughly $0.066 on Sonnet 4.6 for the system prompt alone. That is before the user message or the response. A thousand calls per day is $66. Ten thousand is $660. A hundred thousand is $6,600, which is $198k annually, which is a real team's cost line.

Cached, the first call in a 5-minute window pays $0.0825 for the write. Every subsequent call in that window pays $0.0066 for the read. If the average 5-minute window sees five calls (a modest assumption for a support bot at any real volume), the effective per-call cost is roughly $0.0218 - or 33% of uncached. At twenty calls per 5-minute window, it drops to 12%.

The annual cost at a hundred thousand calls per day moves from $198k uncached to roughly $24k cached (at the twenty-calls-per-window assumption). That is not an optimization. That is the difference between "ship it" and "can not ship it". Same code, same model, same quality. Different cache design.

4 Where caching fits structurally

Where caching fits structurally.

Not every prompt benefits from caching. Knowing which is which is the entire skill.

A Claude API call is structured as a sequence of blocks: system prompt, tool definitions, few-shot examples, retrieved context, conversation history, user message. Cache markers can be placed at block boundaries. What the model caches is everything up to and including the marker. Everything after is fresh per call.

This means the cache boundary needs to sit at the transition from "same across calls" to "different per call". If your system prompt contains a user-specific piece of data (a customer name, a session ID), you have polluted the cacheable block and defeated the purpose. The boundary is often fine on a napkin and ruined in code three sprints later when someone adds personalization.

Three architectural patterns tend to work:

  • Fixed policy + retrieval + query. System prompt is the policy (cached). Retrieved documents are the variable (uncached). User query is the variable (uncached). This is RAG. The cache boundary sits between policy and retrieval.
  • Big document + per-question queries. Document is cached once. Every question against it is a read. This is the "chat with PDF" pattern. Works beautifully when a single session produces many questions.
  • Stable tool schemas + per-call arguments. Tools are defined once (cached). Arguments are per-call (uncached). Agent loops benefit enormously from this boundary.

And three patterns fight caching hard:

  • Conversations where every turn injects fresh context. Every new turn pushes the old context further into the "stable" portion, which would be great - except most teams accidentally rebuild the whole prompt each turn and invalidate the cache.
  • Per-user personalization inside the system prompt. Moves the cache boundary forward past any personalized content. Usually means no effective caching.
  • Random-order few-shot selection. Teams that select a different 5-shot example set per call to "improve diversity" are also selecting a cache miss per call.
5 A case study in three acts

A case study in three acts.

A team we worked with in late 2025 had shipped an AI feature that interviewed users, synthesized their answers, and wrote a personalized "discovery report".

Act one: the prototype. The prototype cost $4.80 per report. The founder was enthusiastic about the quality. We asked what the target revenue per report was. They said $15. We asked about the P90 report cost, not the average. They did not know. They pulled the data that afternoon: P90 was $11.40. Margin at P90 was 24%. Acceptable but thin.

Act two: growth breaks the model. Six weeks later, after a launch, the feature was producing reports at $6.20 per average and $14.10 at P90. P90 margin had evaporated. The product owner was proposing raising price to $29 to recover margin. We asked to see the prompt structure before making any pricing change.

The prompt was 28,000 tokens. Twenty-six thousand of those tokens were stable across every call: the interview-framework primer, the writing guide, the structural template, the example reports. Two thousand were genuinely per-user: the interview answers. The team had not marked the stable block as cacheable. Every call paid full input price for the full 28k tokens.

Act three: the cache boundary. We added a cache marker after the stable block. The typical user produced 3 to 5 reports in a session. Within a session, the first report paid a write premium; the next 2 to 4 hit the cache at 0.1x. Across sessions, the 5-minute TTL let most back-to-back users hit cache from recent activity. Average P90 cost dropped from $14.10 to $3.80 per report. Margin restored without a price change.

The product did not change. The quality did not change. A six-line diff to add a cache marker is the difference. The lesson is not "caching is good". The lesson is that the cache boundary is a design decision that should be made before the first customer sees the feature, not after the bill becomes a problem.

6 Common anti-patterns

Common anti-patterns.

Things we see repeatedly when we audit a team's prompt economics.

Sorting inputs differently per call. A team we audited was pulling retrieved documents in rank order per query. Each query produced a different order. Even if the documents themselves were the same, the cache was invalidated because the token sequence differed. Fix: sort retrieved documents by a stable key (doc ID) before they enter the prompt.

Timestamping the system prompt. Another team included the current date in the system prompt "so the model knows what year it is". The date changed every day (every second, in one version). The cache died every midnight. Fix: put the date outside the cache boundary, in the per-call section.

Per-user name in the preamble. "You are helping {{user_name}}" destroys cacheability per user. If personalization must happen, move it after the cache boundary. The system prompt should describe the role, not the audience.

Random few-shot rotation. Rotating examples to "prevent overfitting" is a cache-hostile practice that also does not reliably improve output quality. Pick a stable set. If you want diversity, do it with user-side sampling, not in-prompt rotation.

Not measuring hit rate. Most teams we audit do not know their cache hit rate. The API returns the stats per call - log them. If your hit rate is below 50%, you either have too short a TTL for your traffic or a structural invalidation problem.

Treating caching as "later". By the time caching makes the backlog, the cost model is already bleeding. Design the cache boundary in the first draft of the prompt, not the fifth.

7 Design the cache boundary first

Closing. Design the cache boundary first.

If you take one thing from this essay: when you are writing the first version of a prompt for a feature that will see real traffic, draw a line.

Everything above the line is stable. Everything below is per-call. The stable portion should include the role, the task description, the output schema, the constraints, the tool definitions, the few-shot examples, and any reference material that does not change per user. The per-call portion should include the user-specific input, the retrieved context (if unique), and the recent conversation turn.

Draw that line on paper before you write the prompt. Then put a cache marker at the line. Then measure the hit rate. Then adjust the TTL or the structure if the rate is low. That is the entire discipline.

The teams that scale AI features profitably are not smarter than the teams that do not. They made one decision earlier: they decided where the cache boundary sat before they wrote the first call. Everything downstream of that decision - the margin, the ability to grow, the resilience to traffic spikes - follows from it.

Prompt caching is not a performance optimization. It is the schema of your feature. Design it that way.

E5.X Related work
Sample
Cost model template
Copy-paste spreadsheet with the caching break-even math worked through three scenarios.
Guide
Cost modeling for AI features
The full cost discipline, caching plus model selection plus batch economics.
Service
Product development
We audit prompt structure and cache design as part of every product engagement.
E5.S Subscribe

One essay a week. No filler.

Four pillars, one email every Tuesday. If we have nothing worth sending, we skip the week.