anthropic

Building Effective AI Agents: Anthropic's Production Patterns

Anthropic's internal playbook for building production AI agents: five workflow patterns, when to use autonomous agents vs predefined workflows, and why tool design matters as much as prompts.

TL;DR

Anthropic worked with dozens of teams and found the best agent implementations use simple, composable patterns—not complex frameworks
Workflows (predefined paths) vs agents (dynamic control): choose based on whether you need predictability or flexibility
Five core workflow patterns: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer
Tool design matters as much as prompts—invest in your agent-computer interface (ACI) like you would HCI

The Big Picture

Most teams building AI agents are doing it wrong. They reach for complex frameworks, add layers of abstraction, and build autonomous systems when a simple prompt chain would work better.

Anthropic just published their internal playbook for building production agents, based on working with dozens of customer implementations across industries. The standout insight: the most successful teams weren't using specialized libraries or complex orchestration frameworks. They were building with basic patterns, optimizing relentlessly, and adding complexity only when simpler approaches failed.

This isn't a theoretical framework. It's a distillation of what actually works in production—from customer support agents handling real tickets to coding agents solving SWE-bench tasks. The guide breaks down five workflow patterns and explains when to use autonomous agents versus predefined workflows.

The core tension: agentic systems trade latency and cost for better task performance. Most developers add that complexity too early. Anthropic's recommendation is blunt: start with optimized single LLM calls with retrieval and in-context examples. Only move to multi-step workflows when you can measure the improvement.

How It Works

Anthropic draws a critical architectural distinction that most teams miss. "Agentic systems" is the umbrella term, but underneath there are two fundamentally different approaches:

Workflows orchestrate LLMs and tools through predefined code paths. You control the sequence. The LLM executes each step, but you've hardcoded the structure.

Agents let the LLM dynamically direct its own process and tool usage. The model maintains control over how it accomplishes tasks. This autonomy is powerful but expensive and error-prone.

The building block for both is the augmented LLM: a model enhanced with retrieval, tools, and memory. Claude can generate its own search queries, select appropriate tools, and determine what information to retain. Anthropic recently released the Model Context Protocol to standardize how these augmentations integrate.

Five Workflow Patterns

1. Prompt chaining decomposes tasks into sequential steps. Each LLM call processes the previous output. Add programmatic checks between steps to verify you're on track. Use this when tasks decompose cleanly into fixed subtasks—like generating marketing copy, then translating it.

2. Routing classifies input and directs it to specialized followup tasks. This prevents the common problem where optimizing for one input type degrades performance on others. Route customer service queries to different processes. Route simple questions to Haiku, complex ones to Sonnet.

3. Parallelization splits work across simultaneous LLM calls. Two variations: sectioning (independent subtasks) and voting (same task, multiple attempts). Use sectioning for guardrails—one instance handles the user query while another screens for inappropriate content. Use voting for code vulnerability reviews where multiple prompts evaluate different aspects.

4. Orchestrator-workers uses a central LLM to dynamically break down tasks and delegate to worker LLMs. The key difference from parallelization: subtasks aren't predefined. The orchestrator determines them based on input. This works for coding products that make complex changes across multiple files, where you can't predict which files need editing upfront.

5. Evaluator-optimizer runs a loop where one LLM generates responses and another provides evaluation and feedback. Use this when you have clear evaluation criteria and iterative refinement provides measurable value. Literary translation is the canonical example—an evaluator can catch nuances the translator missed.

When to Use Autonomous Agents

Agents are for open-ended problems where you can't predict the required steps or hardcode a fixed path. The LLM operates for many turns with minimal human intervention. This requires trust in the model's decision-making.

Anthropic's own implementations: a coding agent that resolves SWE-bench tasks by editing multiple files based on task descriptions, and their computer use reference implementation where Claude controls a desktop environment.

The autonomous nature means higher costs and compounding errors. Anthropic recommends extensive testing in sandboxed environments with appropriate guardrails. Agents work best for tasks that require conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.

What This Changes For Developers

The immediate impact: you probably don't need that agent framework you were evaluating.

Anthropic explicitly calls out the tradeoff with frameworks. They simplify standard tasks like calling LLMs, defining tools, and chaining calls. But they add abstraction layers that obscure prompts and responses, making debugging harder. Worse, they make it tempting to add complexity when simpler setups would work.

The recommendation: start with LLM APIs directly. Many patterns take a few lines of code. If you use a framework, understand the underlying implementation. Incorrect assumptions about what's under the hood are a common source of errors.

The bigger shift is around tool design. Anthropic spent more time optimizing tools than prompts when building their SWE-bench agent. They introduce the concept of agent-computer interfaces (ACI)—the agent equivalent of human-computer interfaces.

Tool design principles:

Give the model enough tokens to "think" before it writes itself into a corner
Keep formats close to what the model has seen naturally on the internet
Eliminate formatting overhead like accurate line counts or string escaping
Write tool descriptions like docstrings for a junior developer—include examples, edge cases, input requirements
Poka-yoke your tools: change arguments to make mistakes harder

Concrete example: Anthropic's agent made mistakes with relative filepaths after moving out of the root directory. They changed the tool to require absolute filepaths. The model used it flawlessly afterward.

For customer support implementations, the pattern is clear: conversation flow plus tool integration for pulling customer data, order history, and knowledge base articles. Actions like refunds or ticket updates handled programmatically. Success measured through resolution rates. Several companies now use usage-based pricing that charges only for successful resolutions.

For coding agents, the advantage is verifiable output through automated tests. Agents iterate using test results as feedback. The problem space is well-defined. But human review remains crucial for ensuring solutions align with broader system requirements.

Try It Yourself

Anthropic published a cookbook with sample implementations of these patterns. Start there rather than with a framework.

For tool integration, the Model Context Protocol provides a standardized approach. The client implementation tutorial shows how to integrate with third-party tools.

If you're building coding agents specifically, study Anthropic's SWE-bench implementation. The agent architecture and tool design decisions are documented in their research posts.

Three core principles for agent implementation:

Maintain simplicity in design
Prioritize transparency by explicitly showing planning steps
Carefully craft your ACI through thorough tool documentation and testing

Test workflow: Run many example inputs in the Anthropic workbench to see what mistakes the model makes with your tools. Iterate on tool definitions based on observed errors.

The Bottom Line

Use workflows when you need predictability and can define the task structure upfront. Use agents when the problem is genuinely open-ended and you can't predict the required steps. Most teams should start with workflows.

Skip agents entirely if you're building: content generation with fixed templates, classification tasks, simple Q&A over documents, or anything where a single optimized LLM call with good retrieval works. The latency and cost aren't worth it.

Build agents if you're tackling: complex customer support requiring multi-step actions across systems, coding tasks involving unpredictable file changes, research tasks requiring iterative information gathering, or any scenario where the model needs to make judgment calls about next steps.

The real risk isn't building agents that are too simple—it's building agents that are too complex. Every abstraction layer is a debugging liability. Every autonomous decision is a potential compounding error. Start simple, measure obsessively, add complexity only when you can prove it improves outcomes. That's the pattern that works in production.

Source: Anthropic