Shipping AI features safely
The full framework. Cost modeling is one layer of six.
Most AI cost surprises are not about rate cards. They are about forgetting to multiply. This is the model a founder or PM can actually use before writing a line of prompt.
A SaaS feature has a deploy cost and a maintenance cost. Running it for one more user is effectively free - you already paid for the server. AI features invert this. The build cost is low. The per-unit cost is real, non-zero, and charged every single time the feature fires.
That changes what a cost model has to answer. It is no longer “what does this cost to build?” It is “what does this cost per action, how often will actions happen, and what is the margin after we pay for tokens?”
A feature that is a blockbuster in a demo and a balance-sheet disaster at scale usually has the same origin story: someone built it without doing this math, shipped it on a flat-rate plan, and learned about the cost the month the usage chart went up and to the right.
Every cost model for an AI feature reduces to this:
Cost per action = (input tokens * input rate)
+ (output tokens * output rate)
+ (cache writes * write rate)
- (cache reads * read discount)
Cost per month = Cost per action * Actions per month
Margin per user = Revenue per user
- (Cost per action * Actions per user per month)
- other COGS
Most of the work in a good cost model is estimating each of those numbers with some honesty. The formula is easy. The inputs are where teams lie to themselves.
One million tokens sounds like a lot. It is not.
A system prompt with guidelines, tool specs, and a handful of few-shot examples is often 4,000 to 12,000 tokens before the user asks anything. Retrieved documents for RAG can be another 5,000 to 30,000 tokens. A chain-of-thought response is typically 500 to 3,000 tokens. One “action” in a real product is rarely under 10,000 total tokens end to end, and often north of 40,000.
The correct way to size a feature is:
If you have not done step 3 yet, your cost model is a guess. Fix that before arguing about rate cards.
Use the rate card your provider publishes. Claude, ChatGPT, and open-weight hosted models all quote per-million-token prices that are easy to plug into the formula above.
What matters more than the exact number is sensitivity. Build your model so you can change one input (say, switch from a frontier model to a smaller one, or cut system prompt by 40 percent) and immediately see the effect on cost per action, monthly cost, and margin. Without that sensitivity, you are guessing about the wrong thing.
Two rules that hold across providers:
Prompt caching is the single biggest lever on AI cost. It is also the one teams most frequently overclaim.
The mechanic: the first call writes a cache entry at a premium (roughly 1.25x of input rate on Claude). Subsequent calls within the TTL read from cache at a discount (roughly 0.1x of input rate). The break-even happens somewhere between the second and third cache hit.
That math is beautiful if your cache actually gets hit. It does not always.
Cache keys in most providers are prefix-based. A conversation that appends one token to the beginning of the prompt invalidates the entire cache. In practice that means:
Model both the hit rate and the miss cost. A 60 percent cache hit rate on a 20,000-token prefix saves real money. A 5 percent hit rate with a 1.25x write premium on every miss actively costs more than no cache at all.
Most providers offer a batch endpoint at roughly half the real-time price. It is one of the fastest ways to cut cost without changing the feature.
The tradeoff is latency: batch jobs may take minutes to hours to complete. For synchronous user-facing features (chat, inline suggestions, anything the user is watching a spinner for) batch is not an option. For everything else, it usually is.
A rough classification:
Many features that teams assume must be real-time become batchable the moment a PM is willing to say “results available within 10 minutes” instead of “instant.” That phrasing change often halves the monthly bill.
A cost-per-user number in isolation is not useful. What matters is the curve: how does cost per user change as your power users use the feature 10x more than your median user?
Build three tiers into your model:
A pricing plan that is profitable on the median user and ruinous on the 99th percentile is not a pricing plan - it is an invitation for someone on Hacker News to post a “I got $8,000 of AI compute on the $29 plan” thread. Model the outlier. Decide in advance whether you cap, rate-limit, meter, or upsell them. Then price accordingly.
After the model is built, four questions decide whether the feature is worth shipping at all:
If margin is negative at 90th-percentile usage, the feature is not shippable at current model costs. Options: cheaper model, shorter prompts, more aggressive caching, batch the workload, reprice the plan, or do not build it. “Ship and hope” is not one of the options.
Hypothetical product. SaaS tool, $49/user/month. Adding a “summarize this document” button.
Inputs measured from prototype:
At a mid-tier model ($3 per million input, $15 per million output), no caching:
With system prompt cached (60 percent hit rate):
Risk: if the feature gets scripted (someone loops it against 10,000 documents) the outlier model breaks. Mitigation: hard rate-limit at 1,000 summaries per user per month, with a “contact us for high-volume” upsell path. Now the cost envelope is bounded.
Verdict: ship it. Ship it with the rate limit, not without.
We maintain a cost model template (linked from samples when shipped) with the following columns: model name, input rate, output rate, cache write rate, cache read discount, measured input tokens, measured output tokens, cache hit rate, actions per median/P90/P99 user, revenue per user, resulting margin at each percentile.
The spreadsheet is less important than the discipline. A model built in any tool that forces you to fill in all those numbers is better than a beautifully formatted one that hides the assumptions.
We run cost-modeling exercises as part of most product-development engagements. If you are trying to decide whether a feature survives contact with a pricing plan, start a conversation. We will walk through the model with you and surface the assumptions that matter.
The full framework. Cost modeling is one layer of six.
A cheaper model is only cheaper if quality holds. Evals tell you whether it does.
We build AI features with the cost model attached, not as an afterthought.