How Anthropic Actually Builds Evals for AI Agents That Ship

Anthropic's playbook for building AI agent evaluations that actually work. Start with 20-50 real failures, combine deterministic and model-based graders, and read the transcripts. The teams that invest early ship faster.

How Anthropic Actually Builds Evals for AI Agents That Ship

TL;DR

  • Evals are the difference between shipping confidently and flying blind—teams without them get stuck fixing one bug while creating three others
  • Agent evals combine three grader types: deterministic (unit tests), model-based (LLM rubrics), and human review—each catches what the others miss
  • Start with 20-50 real failure cases, not hundreds of synthetic tasks. Build early or pay the price later when you're reverse-engineering success criteria from production
  • This matters if you're building any agent that ships to users. Anthropic's playbook works across coding agents, research agents, conversational agents, and computer use agents

The Big Picture

Most teams building AI agents hit the same wall. Early prototypes feel magical. Manual testing works fine. Then you ship to users, and suddenly every fix breaks something else. You're playing whack-a-mole with regressions, and you have no systematic way to know if the agent got better or worse.

Anthropic's solution: build evaluations early, treat them like production code, and use them to drive every decision. Not as an afterthought. Not when things break. From day one.

The company has shipped Claude Code, multi-agent systems that code for hours, and models that crack their own benchmarks. Their eval infrastructure is why they can upgrade models in days while competitors spend weeks testing. This isn't theory—it's the actual playbook from teams shipping agents at scale.

Here's what makes agent evals harder than traditional software testing: agents operate over many turns, call tools autonomously, and modify state as they go. Mistakes compound. A coding agent that misreads a file in turn 3 might spend 20 turns debugging phantom issues. Traditional unit tests don't capture this.

The breakthrough insight: evals aren't just regression tests. They're how you define what "good" means before you build it, how you communicate between product and research teams, and how you know whether a new model is worth upgrading to.

How It Works

An evaluation is a test for an AI system. Give it an input, apply grading logic to the output, measure success. Simple for single-turn prompts. Complex for agents that take hundreds of actions across dozens of turns.

Anthropic breaks agent evals into clear components. A task is a single test with defined inputs and success criteria. A trial is one attempt at that task—you run multiple trials because model outputs vary. A grader scores some aspect of performance. A transcript is the complete record of what happened: every tool call, every reasoning step, every API interaction.

The outcome is what actually changed in the environment. A support agent might say "ticket resolved" in the transcript, but the outcome is whether the database shows a closed ticket. This distinction matters. Agents can lie or hallucinate. State doesn't.

Graders come in three types, and choosing the right mix is critical:

Deterministic graders are code-based checks. Does the code compile? Do the unit tests pass? Is the file in the right location? These are fast, reliable, and should be your first choice when they work. SWE-bench Verified uses this approach: give the agent a GitHub issue, run the test suite, pass only if it fixes the bug without breaking existing tests.

Model-based graders use LLMs to evaluate outputs against rubrics. Did the agent show empathy? Is the code well-structured? Are claims supported by sources? These handle subjective judgments that deterministic checks can't capture. The catch: they need careful calibration against human judgment to avoid hallucinated grades.

Human graders are the gold standard for ambiguous or high-stakes tasks. Expensive and slow, but essential for calibrating the other two types. Use them periodically, not continuously.

The evaluation harness runs everything end-to-end: provides instructions and tools, executes tasks concurrently, records transcripts, applies graders, aggregates results. It needs to be robust—shared state between trials, resource exhaustion, or infrastructure flakiness will make results unreliable.

Two metrics capture non-determinism. pass@k measures the likelihood of at least one success in k attempts. If your agent solves a coding problem 50% of the time, pass@10 is much higher than pass@1—more shots on goal. pass^k measures the probability that all k trials succeed. Same 50% success rate, but pass^10 is near zero. Use pass@k when one good solution matters. Use pass^k when users expect consistency every time.

Different agent types need different eval strategies. Coding agents get deterministic graders—unit tests, static analysis, security scans. Conversational agents need model-based rubrics for interaction quality plus state checks for outcomes. Research agents combine groundedness checks (are claims supported?), coverage checks (did it find key facts?), and source quality checks (are sources authoritative?). Computer use agents require sandboxed environments where you can verify file system state, database contents, and UI element properties after task completion.

Anthropic learned this the hard way with Claude.ai's web search feature. Early evals only tested whether the model searched when it should. Result: the model searched for everything. They rebuilt the eval suite to include both directions—queries where search is appropriate and queries where it's not. Balanced problem sets prevent one-sided optimization.

What This Changes For Developers

The shift is from reactive debugging to proactive development. Without evals, you wait for user complaints, reproduce manually, fix the bug, hope nothing regressed. With evals, you catch issues before they ship, measure improvements objectively, and upgrade models with confidence.

Real example: Descript built evals around three dimensions of video editing success—don't break things, do what I asked, do it well. They evolved from manual grading to LLM graders with periodic human calibration. Now they run two separate suites: quality benchmarking and regression testing. When a new model drops, they know within hours whether to upgrade.

Bolt.new started building evals after they already had a widely used agent. In three months, they built a system that runs the agent, grades outputs with static analysis, uses browser agents to test apps, and employs LLM judges for instruction following. The result: they can ship changes faster because they know what breaks.

The compounding value is easy to miss. Costs are visible upfront—writing tasks, building graders, maintaining infrastructure. Benefits accumulate later—faster iteration, confident upgrades, fewer production fires. Teams that invest early find development accelerates. Teams that wait find themselves reverse-engineering success criteria from a live system.

Evals also change how you adopt new models. When Opus 4.5 launched, teams with evals quickly determined its strengths, tuned prompts, and upgraded in days. Teams without evals faced weeks of manual testing while competitors shipped. The gap compounds with every model release.

One subtle benefit: evals become the communication layer between product and research. Product teams define what success looks like through test cases. Research teams optimize against those metrics. No ambiguity about whether the agent "feels better"—the eval score either went up or it didn't.

Try It Yourself

Start with 20-50 tasks drawn from real failures. Not synthetic scenarios. Not edge cases you imagine. The bugs you've already fixed and the manual checks you run before each release.

Write unambiguous tasks where two domain experts would independently reach the same pass/fail verdict. If the task asks the agent to write a script but doesn't specify a filepath, and your grader assumes a particular filepath, the agent will fail through no fault of its own. Everything the grader checks should be clear from the task description.

For each task, create a reference solution—a known working output that passes all graders. This proves the task is solvable and verifies graders are correctly configured. If frontier models score 0% across many trials, that's usually a broken task, not an incapable agent.

Build balanced problem sets. Test both where a behavior should occur and where it shouldn't. If you only test whether the agent searches when it should, you'll end up with an agent that searches for everything. One-sided evals create one-sided optimization.

Choose graders thoughtfully. Deterministic where possible, LLM-based where necessary, human for calibration. Don't check that agents followed specific steps—grade what they produced, not the path they took. Agents regularly find valid approaches eval designers didn't anticipate.

Read the transcripts. You won't know if graders work until you read many trials. When a task fails, the transcript tells you whether the agent made a genuine mistake or your graders rejected a valid solution. Failures should seem fair—it's clear what went wrong and why.

Several frameworks can accelerate this. Harbor runs agents in containerized environments with infrastructure for trials at scale. Promptfoo offers lightweight YAML configuration for prompt testing. Braintrust combines offline evaluation with production observability. LangSmith integrates tightly with LangChain. Pick one that fits your workflow, then invest energy in the evals themselves.

The pattern that works: automated evals for fast iteration, production monitoring for ground truth, periodic human review for calibration. No single layer catches every issue. Combined, failures that slip through one layer get caught by another.

The Bottom Line

Use evals if you're shipping agents to users and need to iterate without breaking things. Skip them if you're still in early prototyping and every change has obvious, large effects you can catch manually. The inflection point comes when you can no longer tell if the agent got better or worse without systematic measurement.

The real risk is waiting too long. Early in development, product requirements naturally translate into test cases. Wait until you're in production and you're reverse-engineering success criteria from user complaints. The longer you wait, the harder evals become to build.

The real opportunity is treating evals as a core component from day one. Not an afterthought. Not when things break. Anthropic's multi-agent systems that code for hours work because they have evals that catch failures before they compound. Claude Code's autonomous mode ships confidently because every change runs against regression suites.

Start with what you already test manually. Write unambiguous tasks with reference solutions. Build balanced problem sets. Design graders thoughtfully. Read the transcripts. The fundamentals are constant across agent types. The value compounds, but only if you start early.

Source: Anthropic