The 200k window is a context engineering problem

1 The stage metaphor

The stage metaphor.

A 200k context window is not a memory card you fill up. It is a stage the model performs on. Everything on the stage is competing for the model's attention. Everything offstage is free.

The mental model of a context window as memory - the model "remembers" what is in the window - is misleading in a specific way. It implies that filling the window is costless, or that filling it more is always better than filling it less, because memory is a utility you maximize. The reality is different. The attention mechanism is a competitive process, and every token on the stage is competing with every other token for the model's look. A sparse, precise context outperforms a dense one most of the time. A cluttered context degrades quality in ways that look like the model "missed" the important detail, when in fact the important detail was drowned out by twenty other detail-looking things.

The teams we advise most often discover this the hard way. They build a RAG pipeline, they increase the retrieval top-k from five to twenty to be safe, and they find that the model's answers got worse instead of better. The reason is not that the model cannot handle the tokens. The reason is that fifteen of the twenty retrieved chunks were off-topic but relevant-looking, and they pulled the model's attention away from the five that mattered. The fix is not a larger model. The fix is a better retrieval that returns five precise chunks instead of twenty loose ones.

Context engineering is the practice of treating the window as the stage: curating what goes on, arranging it with intent, and keeping everything else offstage where it cannot interfere.

2 Mistake 1: dumping the whole corpus

Mistake 1: dumping the whole corpus.

"Just put the whole document set in the context, we have room." We see this proposed in every engagement where the team has heard that the model has a 200k window.

The reason not to do this is not that it will fail, exactly. It will often appear to work. The reason is that the answer quality is noticeably worse than it would be if the relevant five or ten thousand tokens had been retrieved and the rest had been left out. The degradation is subtle: the model gives a directionally correct answer but misses a nuance, or pulls the wrong quote from a nearby section, or confuses two similarly-phrased pieces of text from different parts of the corpus. These failure modes are hard to catch without an eval harness, which is why they persist in production.

Our rule: if you are putting more than thirty thousand tokens of document context in a prompt, you need a very good reason. "It is easier than writing retrieval" is not a good reason. "The document is actually one coherent long thing that the model needs to reason across holistically" is a good reason (book-length summary, full legal contract review, long code file diff). Most cases that feel like the second are actually the first in disguise.

3 Mistake 2: ignoring position

Mistake 2: ignoring position.

Where something sits in the context window affects how the model treats it. This has been true of every large-context model we have worked with, and it is still true at 200k.

The details shift between model families, but two patterns have held. First, content near the top and near the bottom of the window gets slightly more attention than content in the middle. This is the "lost in the middle" phenomenon and it is well documented. It means the most important instructions should not be buried in the middle of a long document. Second, instructions given after the material they apply to are often more effective than instructions given before, especially for transformations (rewrite, summarize, extract). This is somewhat counterintuitive but we see it consistently: "here is the document. Now, rewrite it as a memo" outperforms "rewrite this document as a memo. Here it is." for the same inputs.

The practical move is to structure every long-context prompt with intent about position. System instructions at the very top (they anchor the model's behavior). Core working material next. The specific task and constraints at the bottom, near where generation will start. Any retrieved chunks labeled with their purpose so the model knows why each one is on stage. Each of these is a small effect, but they compound.

4 Mistake 3: stateful chains that forget to truncate

Mistake 3: stateful chains that forget to truncate.

A multi-turn agent runs. Each turn appends the prior tool result to the context. Nothing ever gets removed. Within ten turns, the context is ninety percent old tool results, most of which are irrelevant to the current step.

This is the single most common failure mode we see in production agent systems. The agent was working fine at turn three. It started making obviously bad decisions at turn twelve. The team blames the model for "losing the thread". The real cause is that the thread has been drowned in stale tool output. The agent is spending its attention on a database query result from eight steps ago that nobody needs anymore.

The fix is explicit context management as a first-class part of the agent loop. Every turn, before the next LLM call, the loop decides what to keep and what to drop. Keep the original task. Keep the most recent tool result. Keep any result explicitly marked as a decision or a finding. Drop everything else, or summarize it into a one-line note. This is uncomfortable work because it forces the team to decide what "matters" in an agent's working memory, but the alternative is agents that quietly degrade after ten turns and nobody knows why.

The extension of this pattern is conversation summarization. For a client-support agent that runs for hours, we summarize the prior hour of conversation into a two-hundred-token state-of-the-conversation note and drop the raw history. The note captures decisions made, open questions, and the user's stated goal. Two hundred tokens of the right summary is worth ten thousand tokens of raw transcript.

5 Mistake 4: not separating system from working memory

Mistake 4: not separating system from working memory.

The system prompt is stable across calls. The working memory changes every call. When they are mixed in the same block, the model has trouble telling what is policy and what is data.

In our pipelines, every prompt has a clear partition. The system section is declarative, stable, and describes the model's role, tone, constraints, and output format. It changes when the pipeline changes, not per call. The working memory section is everything that is specific to this call: the retrieved context, the user's input, the tool results so far. We label the boundary explicitly with a header. We do not shuffle instructions into the working memory.

This matters for two reasons. First, it lets us cache the system section across calls (the prompt caching mechanism rewards stable prefixes), which is meaningful for cost. Second, it prevents a subtle bug: when a user input accidentally looks like an instruction, the model is more likely to treat it as data if the data section is cleanly bounded. This is not a security layer - prompt injection is its own problem - but it is a hygiene measure that reduces a specific class of bug.

6 How we actually assemble a context

How we actually assemble a context.

Here is the rough shape of a long-context prompt we would ship for a handbook-generation call. Sizes are rough because they depend on the engagement, but the proportions hold.

[system prompt ~2k tokens - stable, cached]
    role, voice guide, output format,
    hard constraints, style exemplars

[client context ~4k tokens - stable per engagement, cached]
    intake summary, goals, prior-section summaries

[retrieved chunks ~8-15k tokens - varies]
    exactly the 5-8 chunks relevant to this section
    each one labeled: [SOURCE: X, RELEVANCE: Y]

[prior-section summary ~1k tokens]
    what the handbook has already established
    so this section does not contradict

[task ~500 tokens]
    the specific thing to produce right now,
    with format, length, and any per-call overrides

[scratch ~0-2k tokens]
    tool results from this turn, if any

Total is usually under thirty thousand tokens for a section generation, even though the model supports six-plus times that. We do not use the extra capacity because we do not need it, and every token we do not put on the stage is a token not competing for attention.

The discipline is the thing. The 200k window is a tool that lets us carry more state when state matters, not an invitation to stop thinking about what belongs on the stage. The teams that get the most out of large windows are the ones that treat them most carefully, not the ones that fill them most aggressively.

E16.X Related work

Essay

Prompt caching changed our unit economics

Why the stable-prefix pattern matters, and what it means for context assembly.

Guide

Working with the Claude API

API patterns, caching, tool use, and context assembly in production.

Service

Product development

Designing context-aware pipelines for AI products that hold up in production.

E16.S Subscribe