Multi-Agent Harnesses: How Anthropic Built Apps That Code for Hours

Anthropic built a three-agent system that codes full-stack apps autonomously for hours. The key: separating generation from evaluation and making them argue. Here's how it works and when it's worth the $200 cost.

Multi-Agent Harnesses: How Anthropic Built Apps That Code for Hours

TL;DR

  • Anthropic built a three-agent system (planner, generator, evaluator) that autonomously codes full-stack apps over multi-hour sessions
  • Separating generation from evaluation fixes the self-grading problem — agents are terrible at judging their own work
  • The harness ran for 4+ hours and cost $124-200 per app, but produced working DAWs and game engines from one-sentence prompts
  • As models improve, harness complexity should decrease — Opus 4.6 eliminated the need for sprint decomposition that 4.5 required

The Big Picture

Getting Claude to build complete applications without human intervention has been a moving target. Early attempts hit ceilings fast: agents would lose coherence as context windows filled, praise mediocre work when asked to self-evaluate, and produce apps that looked impressive but broke the moment you clicked anything.

Prithvi Rajasekaran, an engineer on Anthropic's Labs team, spent months attacking this problem from two angles: frontend design (subjective, taste-driven) and autonomous coding (verifiable, correctness-driven). The breakthrough came from borrowing a concept from Generative Adversarial Networks: split the agent into a generator and an evaluator, then make them argue.

The result is a harness architecture that produces full-stack applications over 4-6 hour autonomous coding sessions. Feed it a one-sentence prompt like "build a retro game maker" and it expands that into a 16-feature spec, negotiates implementation contracts between agents, and ships a working app with sprite editors, level designers, and AI-assisted tooling baked in.

This isn't about waiting for better models to solve the problem. It's about building scaffolding that pushes current models past what they can do alone — then stripping that scaffolding back down as the models catch up.

How It Works

Why Single-Agent Approaches Fail

Two failure modes kept showing up in earlier work. First, models lose coherence on lengthy tasks as context windows fill. Claude Sonnet 4.5 exhibited "context anxiety" — it would start wrapping up work prematurely as it approached what it believed was its context limit. Context resets (clearing the window entirely and handing off state via structured artifacts) solved this, but added orchestration complexity and latency.

Second, self-evaluation is broken. Ask an agent to grade its own work and it confidently praises mediocre outputs. This is especially bad for subjective tasks like design, where there's no binary pass/fail check. But even on verifiable tasks, agents show poor judgment that tanks their performance.

Separating the agent doing the work from the agent judging it turns out to be a strong lever. The evaluator is still an LLM inclined to be generous, but tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.

The Frontend Design Experiment

Rajasekaran started with frontend design, where the self-evaluation problem was most visible. Claude normally gravitates toward safe, predictable layouts — technically functional but visually unremarkable.

He wrote four grading criteria that both generator and evaluator received in their prompts:

  • Design quality: Does this feel like a coherent whole or a collection of parts? Colors, typography, layout, and imagery should combine to create a distinct mood.
  • Originality: Are there custom decisions, or is this template layouts and library defaults? Purple gradients over white cards fail here.
  • Craft: Typography hierarchy, spacing consistency, color harmony, contrast ratios. A competence check.
  • Functionality: Can users understand the interface and complete tasks without guessing?

Design quality and originality were weighted heavily. Claude already scored well on craft and functionality by default — the required technical competence came naturally. But on design and originality, outputs were bland at best.

The loop ran on the Claude Agent SDK. A generator agent created an HTML/CSS/JS frontend. The evaluator used the Playwright MCP to interact with the live page directly — navigating, screenshotting, studying the implementation — before scoring each criterion and writing a detailed critique. That feedback flowed back to the generator for the next iteration.

Runs stretched up to four hours, with 5-15 iterations per generation. The generator made a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasn't working.

In one example, the prompt was to create a website for a Dutch art museum. By iteration nine, it had produced a clean, dark-themed landing page. Then on iteration ten, it scrapped everything and reimagined the site as a 3D room with a checkered floor rendered in CSS perspective, artwork hung on walls in free-form positions, and doorway-based navigation between gallery rooms. The kind of creative leap that doesn't happen in single-pass generation.

Scaling to Full-Stack Coding

The generator-evaluator pattern maps naturally onto software development, where code review and QA serve the same structural role. Rajasekaran built a three-agent system:

Planner: Takes a 1-4 sentence prompt and expands it into a full product spec. Prompted to be ambitious about scope and focus on product context rather than granular technical details. Also finds opportunities to weave AI features into the spec.

Generator: Works in sprints (initially — more on this later), picking up one feature at a time from the spec. Implements with a React, Vite, FastAPI, and SQLite stack. Self-evaluates at the end of each sprint before handing off to QA.

Evaluator: Uses Playwright MCP to click through the running application like a user would, testing UI features, API endpoints, and database states. Grades each sprint against bugs found and criteria covering product depth, functionality, visual design, and code quality. Each criterion has a hard threshold — if any one falls below it, the sprint fails and the generator gets detailed feedback.

Before each sprint, the generator and evaluator negotiate a sprint contract: agreeing on what "done" looks like before any code is written. This bridges the gap between high-level user stories and testable implementation. The generator proposes what it will build and how success will be verified. The evaluator reviews the proposal to make sure the generator is building the right thing. They iterate until they agree.

Communication happens via files. One agent writes a file, another reads it and responds either within that file or with a new file the previous agent reads in turn.

What This Changes For Developers

Rajasekaran tested the harness with a prompt to generate a retro video game maker: "Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode."

He ran it against both the full harness (Opus 4.5) and a single-agent system for comparison. The solo run took 20 minutes and cost $9. The harness run took 6 hours and cost $200.

The solo output looked promising at first. But as he clicked through, issues emerged. The layout wasted space. The workflow was rigid — nothing guided you to create sprites and entities before populating a level. Most importantly, the game was broken. Entities appeared on screen but nothing responded to input. The wiring between entity definitions and the game runtime was broken.

The harness run started from the same one-sentence prompt, but the planner expanded it into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted: sprite animation system, behavior templates, sound effects and music, AI-assisted sprite generator and level designer, and game export with shareable links.

The app showed immediate polish. The canvas used the full viewport, panels were sized sensibly, and the interface had a consistent visual identity. The sprite editor was richer and more fully featured. And most importantly, play mode actually worked — you could move your entity and play the game.

The evaluator kept the implementation in line with the spec. Each sprint, it walked through the contract's test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. Sprint 3 alone had 27 criteria covering the level editor.

Examples of issues the evaluator caught:

  • Rectangle fill tool: Tool only placed tiles at drag start/end points instead of filling the region. fillRectangle function existed but wasn't triggered properly on mouseUp.
  • Entity deletion: Delete key handler required both selection and selectedEntityId to be set, but clicking an entity only set selectedEntityId.
  • Frame reordering: PUT /frames/reorder route defined after /{frame_id} routes. FastAPI matched 'reorder' as a frame_id integer and returned 422.

Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, it would identify legitimate issues, then talk itself into deciding they weren't a big deal and approve the work anyway. It also tested superficially rather than probing edge cases.

The tuning loop: read the evaluator's logs, find examples where its judgment diverged from expectations, update the QA prompt to solve for those issues. It took several rounds before the evaluator was grading reasonably.

Simplifying the Harness for Opus 4.6

Every component in a harness encodes an assumption about what the model can't do on its own. Those assumptions are worth stress testing — they may be incorrect, and they can quickly go stale as models improve.

When Opus 4.6 landed, Rajasekaran stripped the harness back. The model planned more carefully, sustained agentic tasks for longer, operated more reliably in larger codebases, and had better code review and debugging skills. It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

He removed the sprint construct entirely. The sprint structure had helped decompose work into chunks for the model to work coherently. Opus 4.6 could handle the job without this decomposition.

He kept both the planner and evaluator. Without the planner, the generator under-scoped — it would start building without first speccing its work and end up creating a less feature-rich application. The evaluator moved to a single pass at the end of the run rather than grading per sprint.

The evaluator's usefulness now depended on where the task sat relative to what the model could do reliably on its own. On 4.5, builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues. On 4.6, the model's raw capability increased. Tasks that used to need the evaluator's check were now often within what the generator handled well on its own. But for parts of the build still at the edge of the generator's capabilities, the evaluator continued to give real lift.

The practical implication: the evaluator is not a fixed yes-or-no decision. It's worth the cost when the task sits beyond what the current model does reliably solo.

Try It Yourself

The harness architecture described here isn't packaged as a ready-to-use tool yet. But the underlying patterns are implementable with the Claude Agent SDK and Playwright MCP.

Key implementation details:

  • Use the Claude Agent SDK for orchestration and automatic context compaction
  • Give the evaluator agent access to Playwright MCP so it can interact with running applications
  • Define grading criteria upfront and calibrate the evaluator with few-shot examples
  • Use file-based communication between agents (one agent writes, another reads and responds)
  • For multi-sprint work, implement contract negotiation before each sprint

If you're experimenting with autonomous coding workflows, the generator-evaluator split is the highest-leverage pattern to test first. Start simple: one generator, one evaluator, clear grading criteria. Add complexity only when you hit a ceiling.

The Bottom Line

Use this approach if you're building agents that need to work autonomously for hours, not minutes — and where output quality matters more than speed or cost. The harness runs are expensive ($124-200 per app) and slow (4-6 hours), but they produce working applications from one-sentence prompts. Skip it if you're prototyping quickly or working on tasks the base model already handles well solo.

The real insight here isn't the specific three-agent architecture. It's the principle: separate generation from evaluation, define concrete grading criteria, and tune the evaluator to be skeptical. As models improve, the scaffolding you need will change — Opus 4.6 eliminated the sprint decomposition that 4.5 required. The interesting work for AI engineers is finding the next novel combination that pushes models past what they can do alone, then stripping it back down as the models catch up.

The space of interesting harness combinations doesn't shrink as models improve. It moves. And right now, it's moving toward multi-agent systems that argue with themselves until the output is actually good.

Source: Anthropic