G5.3 Guide · Product development

← Product development guides

Cost modeling for AI features.

Most AI cost surprises are not about rate cards. They are about forgetting to multiply. This is the model a founder or PM can actually use before writing a line of prompt.

Length: 24 min Audience: Finance / CFO / PM / founder Last updated: 2026-04-19

Why cost models for AI are different

A SaaS feature has a deploy cost and a maintenance cost. Running it for one more user is effectively free - you already paid for the server. AI features invert this. The build cost is low. The per-unit cost is real, non-zero, and charged every single time the feature fires.

That changes what a cost model has to answer. It is no longer “what does this cost to build?” It is “what does this cost per action, how often will actions happen, and what is the margin after we pay for tokens?”

A feature that is a blockbuster in a demo and a balance-sheet disaster at scale usually has the same origin story: someone built it without doing this math, shipped it on a flat-rate plan, and learned about the cost the month the usage chart went up and to the right.

The core formula

Every cost model for an AI feature reduces to this:

Cost per action  =  (input tokens * input rate)
                  + (output tokens * output rate)
                  + (cache writes * write rate)
                  - (cache reads * read discount)

Cost per month   =  Cost per action * Actions per month

Margin per user  =  Revenue per user
                  - (Cost per action * Actions per user per month)
                  - other COGS

Most of the work in a good cost model is estimating each of those numbers with some honesty. The formula is easy. The inputs are where teams lie to themselves.

Step 1: measure tokens, do not guess

One million tokens sounds like a lot. It is not.

A system prompt with guidelines, tool specs, and a handful of few-shot examples is often 4,000 to 12,000 tokens before the user asks anything. Retrieved documents for RAG can be another 5,000 to 30,000 tokens. A chain-of-thought response is typically 500 to 3,000 tokens. One “action” in a real product is rarely under 10,000 total tokens end to end, and often north of 40,000.

The correct way to size a feature is:

  1. Build the shortest working prompt you can.
  2. Run it against 20 representative inputs.
  3. Log the actual input tokens, output tokens, and cache behavior for each call.
  4. Take the 90th percentile, not the average. Your CFO cares about the bad month, not the median month.

If you have not done step 3 yet, your cost model is a guess. Fix that before arguing about rate cards.

Step 2: pick a provider model, then run the sensitivity

Use the rate card your provider publishes. Claude, ChatGPT, and open-weight hosted models all quote per-million-token prices that are easy to plug into the formula above.

What matters more than the exact number is sensitivity. Build your model so you can change one input (say, switch from a frontier model to a smaller one, or cut system prompt by 40 percent) and immediately see the effect on cost per action, monthly cost, and margin. Without that sensitivity, you are guessing about the wrong thing.

Two rules that hold across providers:

  • Output tokens cost more than input tokens. Often 3x to 5x. Reducing output length is almost always cheaper than shrinking input.
  • Smaller models are cheaper than frontier models by 5x to 20x. For most subtasks (routing, classification, extraction, short summaries) a small model is the right default. Reserve the frontier model for the step that actually needs reasoning.

Step 3: model caching honestly

Prompt caching is the single biggest lever on AI cost. It is also the one teams most frequently overclaim.

The mechanic: the first call writes a cache entry at a premium (roughly 1.25x of input rate on Claude). Subsequent calls within the TTL read from cache at a discount (roughly 0.1x of input rate). The break-even happens somewhere between the second and third cache hit.

That math is beautiful if your cache actually gets hit. It does not always.

Cache keys in most providers are prefix-based. A conversation that appends one token to the beginning of the prompt invalidates the entire cache. In practice that means:

  • System prompt, tool definitions, and static context: cache well. Same prefix every call.
  • Retrieved documents in RAG: cache poorly if documents differ per query, which is most of the time.
  • Long-running conversations: cache well as long as messages only append, never edit earlier turns.

Model both the hit rate and the miss cost. A 60 percent cache hit rate on a 20,000-token prefix saves real money. A 5 percent hit rate with a 1.25x write premium on every miss actively costs more than no cache at all.

Step 4: batch what you can, real-time what you must

Most providers offer a batch endpoint at roughly half the real-time price. It is one of the fastest ways to cut cost without changing the feature.

The tradeoff is latency: batch jobs may take minutes to hours to complete. For synchronous user-facing features (chat, inline suggestions, anything the user is watching a spinner for) batch is not an option. For everything else, it usually is.

A rough classification:

  • Must be real-time: chatbots, autocomplete, live translation, conversational interfaces, anything with <5 second UX budget.
  • Can be batched: nightly summaries, content moderation passes, bulk classification, backfills, evals, onboarding-time one-off enrichment.
  • Often overlooked batchable workloads: report generation delivered by email, daily digests, “weekly insights” panels, SEO content pipelines.

Many features that teams assume must be real-time become batchable the moment a PM is willing to say “results available within 10 minutes” instead of “instant.” That phrasing change often halves the monthly bill.

Step 5: model the scaling curve, not the single-user cost

A cost-per-user number in isolation is not useful. What matters is the curve: how does cost per user change as your power users use the feature 10x more than your median user?

Build three tiers into your model:

  • Median user. Actions per month at the 50th percentile.
  • Power user. 90th percentile. Usually 5x to 15x the median.
  • Outlier. 99th percentile. Sometimes 50x to 200x. This is the person who accidentally, or deliberately, loops your feature in a script.

A pricing plan that is profitable on the median user and ruinous on the 99th percentile is not a pricing plan - it is an invitation for someone on Hacker News to post a “I got $8,000 of AI compute on the $29 plan” thread. Model the outlier. Decide in advance whether you cap, rate-limit, meter, or upsell them. Then price accordingly.

Step 6: the “should we build it?” check

After the model is built, four questions decide whether the feature is worth shipping at all:

  1. What is the cost per action at target model quality, with realistic caching? Not the aspirational minimum. The honest number.
  2. What is the action frequency per user per month? Measured or estimated from a comparable feature, not wishful.
  3. What is the incremental revenue this feature drives? Either directly (new plan tier, upsell) or indirectly (retention, conversion). Assign a dollar value.
  4. What is the margin, and does it survive the outlier? If the 99th-percentile user costs more than the median user pays, the answer is either rate-limit or reprice.

If margin is negative at 90th-percentile usage, the feature is not shippable at current model costs. Options: cheaper model, shorter prompts, more aggressive caching, batch the workload, reprice the plan, or do not build it. “Ship and hope” is not one of the options.

Worked example: document summarization feature

Hypothetical product. SaaS tool, $49/user/month. Adding a “summarize this document” button.

Inputs measured from prototype:

  • Average document: 8,000 tokens
  • System prompt: 1,500 tokens (cacheable)
  • Average output: 400 tokens
  • Median user: 12 summaries per month
  • Power user (P90): 80 summaries per month
  • Outlier (P99): 500 summaries per month

At a mid-tier model ($3 per million input, $15 per million output), no caching:

  • Cost per action: (9,500 * $3 + 400 * $15) / 1,000,000 = $0.0285 + $0.006 = $0.0345
  • Median user monthly cost: 12 * $0.0345 = $0.41. Healthy margin.
  • Power user monthly cost: 80 * $0.0345 = $2.76. Still fine.
  • Outlier monthly cost: 500 * $0.0345 = $17.25. Thin but positive on a $49 plan.

With system prompt cached (60 percent hit rate):

  • Input cost drops by roughly $0.004 per action.
  • Outlier drops to $15.25/month. Meaningful.

Risk: if the feature gets scripted (someone loops it against 10,000 documents) the outlier model breaks. Mitigation: hard rate-limit at 1,000 summaries per user per month, with a “contact us for high-volume” upsell path. Now the cost envelope is bounded.

Verdict: ship it. Ship it with the rate limit, not without.

The spreadsheet

We maintain a cost model template (linked from samples when shipped) with the following columns: model name, input rate, output rate, cache write rate, cache read discount, measured input tokens, measured output tokens, cache hit rate, actions per median/P90/P99 user, revenue per user, resulting margin at each percentile.

The spreadsheet is less important than the discipline. A model built in any tool that forces you to fill in all those numbers is better than a beautifully formatted one that hides the assumptions.

Common cost-modeling mistakes

  • Using the quoted rate card. That is the list price. Always include caching, batch, and provider discounts in your model. Assume 20 to 40 percent below list for a serious workload.
  • Averaging when you should P90. “Average cost per action” hides the request that took 80,000 tokens because the user uploaded a 200-page contract.
  • Forgetting evals and dev costs. Running your eval harness nightly costs real tokens. So does the iteration loop during development. Budget them.
  • Modeling a single feature in isolation. If a user has access to three AI features, they use all three. Your cost model needs to sum across features per user, not per feature.
  • Not re-running the model when you change prompts. A prompt change that adds 400 tokens to every call, across a product doing 2M calls per month, is an $800 to $3,000 monthly bill. Re-run the model.

When to talk to us

We run cost-modeling exercises as part of most product-development engagements. If you are trying to decide whether a feature survives contact with a pricing plan, start a conversation. We will walk through the model with you and surface the assumptions that matter.

Related guideMedium · 28 min

Evals: a primer

A cheaper model is only cheaper if quality holds. Evals tell you whether it does.

Related serviceEngagement

Product development

We build AI features with the cost model attached, not as an afterthought.