How Anthropic Built Claude's Multi-Agent Research System

Anthropic's multi-agent research system uses Claude Opus 4 as orchestrator with Sonnet 4 subagents, achieving 90% better performance than single agents. Here's how they built it, the eight prompting principles that emerged, and why production was harder than the prototype.

How Anthropic Built Claude's Multi-Agent Research System

TL;DR

  • Anthropic's multi-agent research system uses Claude Opus 4 as orchestrator with Sonnet 4 subagents, achieving 90% better performance than single-agent systems
  • Token usage explains 80% of performance variance—multi-agent systems burn 15× more tokens than chat but solve problems single agents can't
  • Eight prompting principles emerged: think like your agents, teach delegation, scale effort to complexity, design tools carefully, let agents improve themselves, start wide then narrow, guide thinking, and parallelize tool calls
  • Production reliability requires stateful error handling, full tracing, rainbow deployments, and treating agents as long-running processes, not functions

The Big Picture

Claude's new Research feature isn't just another RAG wrapper. It's a production multi-agent system where a lead agent spawns specialized subagents that search in parallel, compress findings, and coordinate results. Anthropic shipped it after learning hard lessons about agent coordination, prompt engineering at scale, and the gap between prototype and production.

The core insight: research is inherently unpredictable. You can't hardcode a path for exploring complex topics. When humans research, they pivot based on discoveries, follow tangential leads, and continuously update their approach. Single-agent systems fail here because they're sequential. Multi-agent systems win because they parallelize exploration across separate context windows, each with distinct tools and prompts.

The results validate the architecture. On Anthropic's internal research eval, the multi-agent system outperformed single-agent Claude Opus 4 by 90.2%. On BrowseComp—a benchmark testing agents' ability to locate hard-to-find information—three factors explained 95% of performance variance: token usage (80%), number of tool calls, and model choice. Multi-agent systems scale token usage by distributing work across agents with separate context windows. The tradeoff: these systems burn tokens fast. Agents use 4× more tokens than chat, multi-agent systems use 15× more. This only makes economic sense for high-value tasks.

Anthropic's post breaks down the architecture, prompting principles, evaluation strategy, and production engineering challenges. It's one of the most detailed public accounts of shipping a multi-agent system at scale.

How It Works

The architecture follows an orchestrator-worker pattern. A lead agent (Claude Opus 4) analyzes the user query, develops a strategy, and spawns subagents (Claude Sonnet 4) to explore different aspects simultaneously. Each subagent acts as an intelligent filter: it iteratively uses search tools, evaluates results with interleaved thinking, and returns compressed findings to the lead agent.

This differs fundamentally from traditional RAG. Static retrieval fetches chunks similar to a query and generates a response. Anthropic's system uses multi-step search that dynamically finds information, adapts to findings, and analyzes results. The lead agent enters an iterative research loop: it thinks through the approach, saves the plan to memory (critical when context exceeds 200K tokens), creates specialized subagents with specific tasks, synthesizes their findings, and decides whether more research is needed.

Once sufficient information is gathered, the system exits the loop and passes findings to a CitationAgent, which processes documents and identifies specific locations for citations. This ensures all claims are attributed to sources. The final research results, complete with citations, return to the user.

The coordination complexity grows rapidly. Early agents spawned 50 subagents for simple queries, scoured the web endlessly for nonexistent sources, and distracted each other with excessive updates. Since each agent is steered by a prompt, prompt engineering became the primary lever for improvement.

Eight Prompting Principles

1. Think like your agents. Anthropic built simulations in Console with exact prompts and tools from production, then watched agents work step-by-step. This revealed failure modes: agents continuing when they had sufficient results, using verbose search queries, selecting incorrect tools. Effective prompting requires an accurate mental model of the agent.

2. Teach the orchestrator how to delegate. Each subagent needs an objective, output format, tool guidance, and clear task boundaries. Without detailed task descriptions, agents duplicate work or leave gaps. Early versions allowed simple instructions like "research the semiconductor shortage," but subagents misinterpreted tasks or performed identical searches. One explored the 2021 automotive chip crisis while two others duplicated work on 2025 supply chains.

3. Scale effort to query complexity. Agents struggle to judge appropriate effort, so Anthropic embedded scaling rules in prompts. Simple fact-finding requires 1 agent with 3-10 tool calls. Direct comparisons need 2-4 subagents with 10-15 calls each. Complex research uses 10+ subagents with divided responsibilities. These guidelines prevent overinvestment in simple queries.

4. Tool design and selection are critical. Agent-tool interfaces are as critical as human-computer interfaces. Using the right tool is often strictly necessary. An agent searching the web for context that only exists in Slack is doomed from the start. With MCP servers giving models access to external tools, this compounds—agents encounter unseen tools with varying description quality. Anthropic gave agents explicit heuristics: examine all available tools first, match tool usage to user intent, search the web for broad exploration, prefer specialized tools over generic ones. Bad tool descriptions send agents down wrong paths. Writing effective tool descriptions became critical enough that Anthropic created a tool-testing agent to rewrite flawed MCP tool descriptions, resulting in a 40% decrease in task completion time.

5. Let agents improve themselves. Claude 4 models are excellent prompt engineers. When given a prompt and failure mode, they diagnose why the agent is failing and suggest improvements. The tool-testing agent attempts to use flawed tools, then rewrites descriptions to avoid failures. By testing tools dozens of times, it found key nuances and bugs.

6. Start wide, then narrow down. Search strategy should mirror expert human research: explore the landscape before drilling into specifics. Agents default to overly long, specific queries that return few results. Anthropic counteracted this by prompting agents to start with short, broad queries, evaluate what's available, then progressively narrow focus.

7. Guide the thinking process. Extended thinking mode serves as a controllable scratchpad. The lead agent uses thinking to plan its approach, assess which tools fit the task, determine query complexity and subagent count, and define each subagent's role. Testing showed extended thinking improved instruction-following, reasoning, and efficiency. Subagents use interleaved thinking after tool results to evaluate quality, identify gaps, and refine their next query.

8. Parallel tool calling transforms speed and performance. Early agents executed sequential searches, which was painfully slow. Anthropic introduced two kinds of parallelization: the lead agent spins up 3-5 subagents in parallel, and subagents use 3+ tools in parallel. These changes cut research time by up to 90% for complex queries.

The prompting strategy focuses on instilling good heuristics rather than rigid rules. Anthropic studied how skilled humans approach research and encoded these strategies: decompose difficult questions, evaluate source quality, adjust search approaches based on new information, recognize when to focus on depth versus breadth. They also set explicit guardrails to prevent agents from spiraling out of control.

What This Changes For Developers

Multi-agent systems introduce new challenges in coordination, evaluation, and reliability. Traditional software assumes deterministic paths: given input X, follow path Y to produce output Z. Multi-agent systems don't work this way. Even with identical starting points, agents take completely different valid paths to reach their goal.

Anthropic's evaluation strategy evolved to handle this. They started with 20 queries representing real usage patterns. With effect sizes this large early on—a prompt tweak might boost success rates from 30% to 80%—small test sets were sufficient. The lesson: start evaluating immediately with small samples rather than delaying until you can build thorough evals.

For scaled evaluation, Anthropic used LLM-as-judge with a rubric covering factual accuracy, citation accuracy, completeness, source quality, and tool efficiency. A single LLM call outputting scores from 0.0-1.0 and a pass-fail grade proved most consistent and aligned with human judgments. This worked especially well when test cases had clear answers.

Human evaluation caught what automation missed: hallucinated answers on unusual queries, system failures, subtle source selection biases. Human testers noticed early agents consistently chose SEO-optimized content farms over authoritative sources like academic PDFs or personal blogs. Adding source quality heuristics to prompts helped resolve this.

Production reliability required new approaches. Agents are stateful and run for long periods, maintaining state across many tool calls. Minor system failures can be catastrophic. Anthropic built systems that resume from where errors occurred rather than restarting from the beginning. They combine AI adaptability—letting the agent know when a tool is failing so it can adapt—with deterministic safeguards like retry logic and regular checkpoints.

Debugging benefits from full production tracing. Agents make dynamic decisions and are non-deterministic between runs. When users reported agents "not finding obvious information," Anthropic couldn't see why without tracing. Were agents using bad search queries? Choosing poor sources? Hitting tool failures? Full tracing let them diagnose root causes and fix issues systematically. They monitor agent decision patterns and interaction structures without monitoring individual conversation contents, maintaining user privacy.

Deployment needs careful coordination. Agent systems are stateful webs of prompts, tools, and execution logic that run almost continuously. Updates can't break existing agents mid-process. Anthropic uses rainbow deployments to gradually shift traffic from old to new versions while keeping both running simultaneously.

Try It Yourself

Anthropic published open-source prompts in their Cookbook showing example prompts from the Research system. The patterns demonstrate orchestrator-worker architectures, delegation strategies, and tool coordination.

For developers building multi-agent systems, Anthropic's appendix offers additional tips:

End-state evaluation for agents that mutate state. Focus on whether the agent achieved the correct final state rather than judging turn-by-turn process. For complex workflows, break evaluation into discrete checkpoints where specific state changes should have occurred.

Long-horizon conversation management. Production agents engage in conversations spanning hundreds of turns. Anthropic implemented patterns where agents summarize completed work phases and store essential information in external memory before proceeding. When context limits approach, agents spawn fresh subagents with clean contexts while maintaining continuity through careful handoffs. Context engineering becomes critical for managing these extended interactions.

Subagent output to a filesystem. Direct subagent outputs can bypass the main coordinator for certain results, improving fidelity and performance. Rather than requiring subagents to communicate everything through the lead agent, implement artifact systems where specialized agents create outputs that persist independently. Subagents call tools to store their work in external systems, then pass lightweight references back to the coordinator.

The Bottom Line

Use multi-agent systems if you're building research tools, complex information synthesis, or tasks requiring parallel exploration across multiple sources. The 90% performance gain over single-agent systems justifies the 15× token cost for high-value work. Skip it if your tasks are sequential, require shared context across all agents, or involve tight real-time coordination—most coding tasks fall into this category.

The real risk is underestimating the production gap. Anthropic's post makes clear that the last mile becomes most of the journey. Codebases that work on developer machines require significant engineering to become reliable production systems. The compound nature of errors means minor issues derail agents entirely. One step failing causes agents to explore entirely different trajectories.

The real opportunity is that multi-agent systems are already transforming how people solve complex problems. Users report Claude helped them find business opportunities they hadn't considered, navigate complex healthcare options, resolve technical bugs, and save days of work by uncovering research connections they wouldn't have found alone. The top use cases: developing software systems across specialized domains (10%), developing and optimizing professional content (8%), developing business growth strategies (8%), assisting with academic research (7%), and researching and verifying information about people, places, or organizations (5%).

Multi-agent research systems can operate reliably at scale with careful engineering, comprehensive testing, detail-oriented prompt and tool design, robust operational practices, and tight collaboration between teams who understand current agent capabilities. The architecture works. The challenge is shipping it.

Source: Anthropic