Context Engineering: Why Your AI Agent Needs Less, Not More

Anthropic's guide to context engineering reveals why LLMs degrade with bloated context windows — and how to build agents that use compaction, just-in-time retrieval, and structured memory to stay focused across long-horizon tasks.

Context Engineering: Why Your AI Agent Needs Less, Not More

TL;DR

  • Context engineering is replacing prompt engineering as the core discipline for building AI agents
  • LLMs suffer from "context rot" — their accuracy degrades as context windows fill up, even with 200K+ token limits
  • The goal isn't maximizing context, it's finding the smallest set of high-signal tokens that drive desired behavior
  • Anthropic's techniques: compaction, structured note-taking, sub-agent architectures, and just-in-time retrieval
  • Essential for anyone building multi-turn agents or long-horizon tasks

The Big Picture

Prompt engineering is dead. Long live context engineering.

After years of obsessing over the perfect system prompt, the AI engineering community is waking up to a harder problem: managing the entire state of information an LLM sees during inference. Anthropic calls this "context engineering" — the art of curating what lands in your model's attention budget across multiple turns of inference.

This isn't just semantic rebranding. The shift reflects a fundamental change in how we're building with LLMs. Early use cases were mostly one-shot tasks: classify this text, generate that response, done. Now we're building agents that loop for minutes or hours, accumulating tool outputs, message history, file contents, and search results until the context window becomes a junk drawer of potentially relevant information.

The problem? LLMs don't handle bloated context well. Research on "context rot" shows that as token count increases, model accuracy decreases — even within the official context window limit. Anthropic's position is clear: context is a finite resource with diminishing marginal returns. Every token you add depletes the model's attention budget.

This matters because the agents we're building today aren't just chatbots. They're autonomous systems that use tools in loops, navigate codebases, conduct research, and operate over extended time horizons. If you're building anything beyond a simple RAG pipeline, you need to think in context.

How It Works

Context engineering starts with understanding why LLMs struggle with large contexts. The transformer architecture enables every token to attend to every other token, creating n² pairwise relationships for n tokens. As context length increases, the model's ability to capture these relationships gets stretched thin.

There's also a training data problem. Models see far more short sequences than long ones during training, so they have less experience with context-wide dependencies. Techniques like position encoding interpolation let models handle longer sequences by adapting them to the originally trained context size, but with degraded precision for information retrieval.

The result isn't a hard cliff — it's a performance gradient. Models remain capable at longer contexts but show reduced precision compared to their performance on shorter ones. This creates the core tension: you want to give your agent enough information to succeed, but every additional token risks diluting its attention.

Anthropic's framework breaks context into components: system prompts, tools, examples, message history, and retrieved data. Each component needs aggressive curation.

System prompts should hit the "right altitude" — specific enough to guide behavior, flexible enough to avoid brittle if-else logic. Anthropic recommends organizing prompts into distinct sections using XML tags or Markdown headers, but emphasizes that formatting matters less than finding the minimal set of information that fully outlines expected behavior. Start with a minimal prompt on the best available model, then add instructions based on observed failure modes.

Tools define the contract between agents and their environment. Bloated tool sets are a common failure mode. If a human engineer can't definitively say which tool to use in a given situation, an AI agent won't do better. Tools should be self-contained, robust to error, and extremely clear about their intended use. Input parameters should be descriptive and unambiguous.

Examples remain critical, but Anthropic warns against stuffing edge cases into prompts. Instead, curate a diverse set of canonical examples that portray expected behavior. For an LLM, examples are the pictures worth a thousand words.

Just-in-time retrieval is where context engineering diverges most sharply from traditional RAG. Rather than pre-processing all relevant data upfront, agents maintain lightweight identifiers — file paths, stored queries, web links — and use tools to dynamically load data at runtime. Claude Code uses this approach to analyze large databases without ever loading full data objects into context. It writes targeted queries, stores results, and leverages Bash commands like head and tail to work with data incrementally.

This mirrors human cognition. We don't memorize entire corpuses — we build indexing systems like file hierarchies and bookmarks to retrieve information on demand. The metadata of these references provides important signals: a file named test_utils.py in a tests folder implies different purpose than the same filename in src/core_logic/.

The trade-off is speed. Runtime exploration is slower than retrieving pre-computed data. Anthropic suggests a hybrid strategy: retrieve some data upfront for speed, enable autonomous exploration for the rest. The decision boundary depends on the task. Claude Code drops CLAUDE.md files into context upfront while using glob and grep for just-in-time navigation.

What This Changes For Developers

If you're building agents, context engineering changes your entire workflow. You're no longer just writing prompts — you're designing information architectures that evolve across dozens or hundreds of inference turns.

The most immediate impact is on long-horizon tasks. Agents that run for tens of minutes or hours will hit context window limits no matter how large those windows get. Anthropic uses three techniques to handle this:

Compaction summarizes conversation history when approaching context limits, then reinitiates with the compressed version. In Claude Code, the model preserves architectural decisions, unresolved bugs, and implementation details while discarding redundant tool outputs. The art is in selection — overly aggressive compaction loses subtle but critical context. Anthropic recommends tuning compaction prompts on complex agent traces, starting with maximum recall then iterating to improve precision. The lightest-touch form is tool result clearing, which removes raw tool outputs once they're no longer needed.

Structured note-taking gives agents persistent memory outside the context window. The agent writes notes to a file or memory system, then pulls them back in later. Claude playing Pokémon on Twitch demonstrates this: the agent maintains precise tallies across thousands of game steps, tracking objectives like "for the last 1,234 steps I've been training my Pokémon in Route 1, Pikachu has gained 8 levels toward the target of 10." After context resets, it reads its own notes and continues multi-hour training sequences. Anthropic recently released a memory tool in beta that makes this easier through a file-based system.

Sub-agent architectures split work across specialized agents with clean context windows. The main agent coordinates with a high-level plan while subagents handle focused tasks. Each subagent might use tens of thousands of tokens but returns only a condensed summary of 1,000-2,000 tokens. Anthropic's multi-agent research system showed substantial improvement over single-agent systems on complex research tasks using this pattern.

The choice between techniques depends on task characteristics. Compaction maintains conversational flow for extensive back-and-forth. Note-taking excels for iterative development with clear milestones. Multi-agent architectures handle complex research where parallel exploration pays dividends.

For developers, this means rethinking how you structure agent loops. You're not just calling an LLM API — you're managing a stateful system with memory constraints. You need strategies for what to keep, what to discard, and what to persist outside the context window.

Try It Yourself

Anthropic provides a memory and context management cookbook on the Claude Developer Platform with practical examples. The memory tool is in public beta, letting you store and consult information outside the context window through a file-based system.

If you're building agents today, start by auditing your context usage. How many tokens are you passing per turn? How much of that is actually relevant to the next decision? Can you replace upfront retrieval with just-in-time exploration?

The simplest experiment: implement tool result clearing. Once a tool has been called deep in message history, clear the raw result. Keep only what the agent explicitly noted as important. Measure the impact on both token usage and task success rate.

For longer-horizon tasks, try structured note-taking. Give your agent a NOTES.md file and a tool to read/write it. Prompt the agent to maintain a running log of progress, decisions, and unresolved issues. After 50+ turns, compact the message history but keep the notes file. See if the agent maintains coherence.

The Bottom Line

Use context engineering if you're building agents that loop, operate over multiple turns, or handle long-horizon tasks. Skip it if you're doing one-shot classification or simple RAG where you control exactly what goes into context.

The real risk is treating context windows as infinite just because they're large. A 200K token limit doesn't mean you should use 200K tokens. Context rot is real, and every token you add dilutes the model's attention. The opportunity is in building agents that intelligently manage their own information diet — exploring just-in-time, taking notes, and compacting history when needed.

As models improve, they'll require less prescriptive engineering. But the fundamental constraint remains: attention is finite. The teams that win will be the ones who treat context as a precious resource, not a dumping ground. Start by asking not "what can I add to context?" but "what's the minimum I need to achieve this outcome?"

That shift in thinking is what separates prompt engineering from context engineering. And it's what separates agents that work from agents that scale.

Source: Anthropic