The Cloudwrit AI Product Handbook,
written so your feature survives its next model upgrade.

An eighty-six page account of one live AI feature (Summarize), what version two needs to ship safely, and how the cost curve flattens when caching and evals are treated as first-class. Twenty-one findings. Four critical. One firm opinion about prompt ownership.

Sanitized sample · Not a real engagement

What you are about to read

01The Summarize feature today, in one pagep. 04
02Four evals to set up before you ship v2p. 09
03Findings: quality, cost, latency, safetyp. 18
04How Summarize actually works end to endp. 38
05The cost model, with caching and batchp. 52
06The six month roadmap, sequencedp. 68
07Appendix: prompt versions, eval runs, decisionsp. 78

Chapter 01 The Summarize feature today, in one page.

Summarize went live in September 2025. Your users generated 2.3 million summaries last quarter. Spend is $12,400 a month on Claude tokens, trending up eighteen percent month over month, tracking usage closely. Quality is self-reported “good” by eighty-three percent of users in-product, which sounds healthy until you look at the twelve-case eval suite (too small), the absence of a regression test on model releases (risky), and the thirty-eight percent of spend that is the exact same structural prefix being resent on every request (avoidable).

The thesis of this handbook is that Summarize is a real feature doing real work, and the next six months are the difference between shipping version two safely and waking up to a regression after the next model rev. The fixes are not radical. They are the standard operating habits of a team that ships AI features for a living. We will put them in place and leave the eval harness running.

North star By Q4 2026, every prompt change to Summarize should ship behind an eval gate with at least one hundred golden cases, a structural-prefix cache should be reducing spend by thirty to forty percent, p99 latency should be under five seconds, and a fallback path should exist for the three named outage modes.

We wrote this for three readers. A. Kim, who owns the AI product line day to day. J. Patel, who approves the infra spend. And the applied AI engineer you will hire in Q2, whose first week should be reading this document and running the harness without anyone else's help.

Chapter 02 Four evals to set up before you ship v2.

We ran a ten day engagement across your Summarize codebase, your CI config, your observability stack, and one month of user telemetry. Twenty-one findings. Four require action before the v2 release window you have targeted for late Q2. The rest are sequenced in chapter six.

F-001

No regression test on model upgrade, silent quality drop on last release

When Anthropic shipped claude-sonnet-4-6 in March, Summarize quality on our later-reconstructed eval dropped eleven percent on technical documents. Your team did not notice for twelve days, and then only from a support ticket. A regression gate with one hundred cases would have flagged it the same hour the model switched. Fix: golden dataset in appendix, scorer specified, CI job drafted.

Critical

F-002

Prompt-injection review is missing from the ingestion path

Summaries are generated from user-submitted documents. A document authored by a third party (collaborator, customer, public share) can embed instructions that your current prompt does not defuse. We wrote a proof of concept that caused Summarize to respond with injected text instead of a summary. See Appendix C.

Critical

F-003

No cache on structural prefix, thirty-eight percent of spend is duplicated tokens

Your prompt template is two thousand tokens: system prompt, style guide, example transcripts. The variable part is the user document. Every request resends the whole prefix. Enabling Anthropic prompt caching on the prefix cuts those two thousand tokens to fifty per request. Measured on one week of traffic, estimated savings of $4,700 / month. Fix is a header and a cache boundary.

High

F-004

p99 latency at 12 seconds, no timeout, no fallback

On long technical PDFs, generation can run 12+ seconds. There is no timeout, no streaming, no “still working” UI state, and no shorter-context fallback. When a user closes the tab at 8 seconds, your code retries twice on the server, silently paying for compute the user will not see.

High

All four are process gaps, not skill gaps. Your team knew each of these was a risk; the quarter got away from you, as quarters do. Chapter 05 defines the process discipline. Chapter 06 sequences the fixes alongside the new cost model so the team ships the savings, not just the feature.

Chapter 03 Findings: quality, cost, latency, safety.

The full chapter reproduces all twenty-one findings, grouped by axis, with a reproduction command, the before metric, the proposed fix (inline), and the after metric we project. Appendix D carries the transcript of Claude's reasoning on each finding so your team can audit or extend.

A fragment follows. The full chapter is omitted from the sample.

Figure 3.1 · Findings by axis and severity

Quality

Cost

Latency

Safety

Critical

High

Medium

Low

Cost findings are spread Medium to High, which is a pattern we see often: cost problems rarely reach Critical because they are legible to finance and raised earlier. Quality issues concentrate High because they are only caught by evals, and the current eval surface is narrow. Safety has one Critical (injection) and one Medium, which is the distribution you would expect from a team that built the guardrails by intuition.

Chapter 04 How Summarize actually works end to end.

Four people on your team can explain the full path. By the end of this chapter, any engineer can. The chapter walks the single-request flow (user click to summary render) in twenty-two pages, with timing at every hop, and then narrates the three variants (long document, collaborative document, export-to-PDF) each in four pages. Every time Claude or the API surfaces a failure mode, the chapter names it and says where the fallback should live. Nothing in this chapter is aspirational.

The request boundary: what the client sends, what the server adds, what the model receives.
The four known failure modes, their frequency in your traffic, and the fallback each triggers.
The three places caching is or should be applied (prompt prefix, user document hash, final summary).
The observability surface: what is logged, what is not, what should be.
Where v2 will change each of the above, with the expected before / after metric.

Chapter 05 The cost model, with caching and batch.

This chapter is the one your finance partner will want. It opens with a spreadsheet (reproduced inline) of current per-summary cost, projected per-summary cost after caching, projected per-summary cost with batch for asynchronous bulk summarization, and the sensitivity of each to input-document length. The chapter ends with a short answer to the question finance keeps asking: at current growth, when does Summarize stop being a line item and start being a margin question? The answer under our assumptions is Q2 2027, and the chapter shows the three levers that can push that back by two quarters.

Chapter 06 The six month roadmap, sequenced.

All twenty-one findings plus four platform initiatives (eval harness, prefix cache, fallback paths, cost telemetry) sequenced week by week, assuming flat AI engineering headcount through Q2 and +1 in Q3. Every item has an owner, an estimate, a definition of done, and a reviewer. The roadmap is drafted so you can swap any one item without breaking the chain, and the chapter explains which swaps are free and which cost you a week elsewhere.

Opinion, clearly marked as such We would ship F-003 (prefix cache) before F-001 (regression eval) even though F-001 is the more important capability. Reason: F-003 takes two days, saves $4,700 / month, and funds the harness work. Shipping a cost win first buys political room for the eval work. The cost of getting this wrong is one round of re-sequencing, not a regression. The chapter explains how to flip it if your team disagrees.

The remaining pages continue in this register: plain, sequenced, specific. Appendices include the golden dataset starter, Claude's full reasoning transcripts on every finding, an annotated prompt-version history, and a decision log that starts today and keeps a running record of every AI-product choice you make for as long as this handbook is live.

End of sample The remaining sixty pages cover the full findings list, the cost modeling chapter, the observability build, and appendices. If this is the kind of artifact you want for your own AI feature - not a post-mortem after v2, but a document your team uses to build v2 - we would like to write one for you.

An AI feature your users keep using.
Not a demo they stopped loading.

The Cloudwrit AI Product Handbook,
written so your feature survives its next model upgrade.

Chapter 01 The Summarize feature today, in one page.

Chapter 02 Four evals to set up before you ship v2.

No regression test on model upgrade, silent quality drop on last release

Prompt-injection review is missing from the ingestion path

No cache on structural prefix, thirty-eight percent of spend is duplicated tokens

p99 latency at 12 seconds, no timeout, no fallback

Chapter 03 Findings: quality, cost, latency, safety.

Chapter 04 How Summarize actually works end to end.

Chapter 05 The cost model, with caching and batch.

Chapter 06 The six month roadmap, sequenced.

The same artifact, four more ways.

Security handbook

Web engineering handbook

SEO / GEO handbook

Growth handbook

Commission an AI product handbook for your feature.

An AI feature your users keep using. Not a demo they stopped loading.

The Cloudwrit AI Product Handbook,written so your feature survives its next model upgrade.

Chapter 01 The Summarize feature today, in one page.

Chapter 02 Four evals to set up before you ship v2.

No regression test on model upgrade, silent quality drop on last release

Prompt-injection review is missing from the ingestion path

No cache on structural prefix, thirty-eight percent of spend is duplicated tokens

p99 latency at 12 seconds, no timeout, no fallback

Chapter 03 Findings: quality, cost, latency, safety.

Chapter 04 How Summarize actually works end to end.

Chapter 05 The cost model, with caching and batch.

Chapter 06 The six month roadmap, sequenced.

The same artifact, four more ways.

Security handbook

Web engineering handbook

SEO / GEO handbook

Growth handbook

Commission an AI product handbook for your feature.

An AI feature your users keep using.
Not a demo they stopped loading.

The Cloudwrit AI Product Handbook,
written so your feature survives its next model upgrade.