The context ladder: what your agent should always see, sometimes see, and only fetch on demand

When an agent gives a wrong answer, the instinct is to blame the model. Usually the model was fine. It just never saw the fact it needed — or it saw so much irrelevant detail that the relevant part got lost.

That’s a context problem, not an intelligence problem. And context is something you design, not something you hope for. After enough “why didn’t it know that?” bugs, I started treating context as a layered system with explicit rules, the way you’d treat a cache hierarchy. Here’s the model that came out of it.

The ladder

Every piece of data an agent might use sits on a ladder, from cheap and always-present at the bottom to expensive and fetched-on-demand at the top. The rung a piece of data belongs on is decided by two questions: how fresh must it be, and does the model need it every turn or only sometimes?

A rough version of the ladder I use:

Identity & policy. Who the agent is, what it’s allowed to do, its tone and refusal posture. Immutable between deployments, baked into the system prompt, always present. This is the prefix that should be cacheable.
User profile. Who the agent is talking to and the stable facts about them. Present on every turn once the user is known, but sourced from a database under a TTL — not hardcoded.
Session signals. The narrow slice of profile data that can change during a conversation: a subscription that activates mid-session, a device that connects while the user is typing. Same storage as the profile, shorter refresh window.
Conversation history. The current transcript, trimmed to a recent window. Always present, naturally bounded.
Per-turn scratch. Routing decisions, one-time flags, classifier output. Lives for a single request and should be discarded after — not leaked into the next turn.
Tool-retrieved data. Anything large, sensitive, or expensive: live business data, search results, detailed activity. Never preloaded. The model has to ask for it, and the request has to have a justified intent.
Long-term memory. Durable facts the user told you in past conversations, retrieved semantically only when the user is explicitly recalling something.

A well-instrumented agent answers most questions without ever reaching the top two rungs. Those exist for the genuinely expensive and the genuinely private.

Why the rungs matter

The temptation is to shove everything into the system prompt “just in case.” It feels safe. It is not.

Every always-present field costs tokens on every single turn, in every single conversation. A field that adds 80 tokens and is relevant 5% of the time is pure tax the other 95%. Worse, a bloated prompt doesn’t just cost money — it degrades reasoning. The model has to find the signal in everything you’ve handed it, and the more you hand it, the more often it grabs the wrong thing.

So the discipline is the opposite of “just in case.” A new always-present field has to earn its rung. If it’s only relevant under a condition, it gets compiled in conditionally. If it’s only relevant when the user asks, it moves up to a tool. The default answer to “should the agent always see this?” is no.

Freshness is a contract, not a vibe

The other half of the ladder is freshness. Each field gets a small, finite set of allowed refresh behaviours — not a bespoke value per field, because bespoke freshness rules are how you end up with subtle staleness bugs nobody can reproduce.

The pairing that matters most: event-driven plus a short TTL, together. The events give you near-real-time correctness when something actually happens — a webhook fires, a state flips. The TTL bounds your wrongness when the event bus goes quiet. Either one alone has a failure mode. Together they’re robust.

The single most common production bug in this space is the agent confidently asserting stale state: “you have three children” right after the user added a fourth. That’s almost always a freshness contract that was never written down.

The triage trick

The real payoff of an explicit ladder shows up when something breaks. A user says “the agent didn’t know X.” Instead of guessing, you ask: which rung should X have been on?

If X belongs on an always-present rung and wasn’t there, you have an inclusion or freshness bug — go look at the compile step.
If X was only ever reachable through a tool and the model didn’t call the tool, you have a routing problem — a different fix entirely.

That one question turns a vague “the AI is dumb” complaint into a specific, locatable defect. Which is the whole point of treating context as a system: not to make it elegant, but to make it debuggable.

The model is rarely the bottleneck. What you chose to put in front of it almost always is.