How Anthropic Solved the Long-Running Agent Problem

Anthropic cracked the long-running agent problem with a two-part harness: an initializer that scaffolds 200+ features, and a coding agent that works incrementally with git-based progress tracking. Here's how it works.

How Anthropic Solved the Long-Running Agent Problem

TL;DR

  • Long-running agents fail because each new context window starts with zero memory of previous work
  • Anthropic's solution: an initializer agent that scaffolds the environment, plus a coding agent that works incrementally and leaves clean artifacts
  • Key innovations include a structured feature list (200+ items), git-based progress tracking, and mandatory end-to-end testing with browser automation
  • This matters if you're building agents that need to work across hours or days, not just single sessions

The Big Picture

AI agents hit a wall when tasks stretch beyond a single context window. The problem isn't capability — it's memory. Every new session starts from scratch, like hiring engineers who work in shifts but never talk to each other.

Anthropic's engineering team ran into this building the Claude Agent SDK. Even Opus 4.5 would fail to build a production web app when given a high-level prompt like "build a clone of claude.ai." The agent would either try to one-shot the entire app and run out of context mid-implementation, or look around after a few features shipped and declare victory prematurely.

Context compaction wasn't enough. The real issue was structural: agents needed a way to understand what happened before, make incremental progress, and leave clear breadcrumbs for the next session. Anthropic's solution splits the work between two specialized prompts — an initializer that scaffolds the environment on first run, and a coding agent that ships features incrementally while maintaining clean state.

This isn't theoretical. The approach enabled Claude to build a functional claude.ai clone across multiple context windows, working feature-by-feature with proper testing and git hygiene. The techniques generalize beyond web apps to any long-horizon agentic work.

How It Works

The architecture uses two distinct agent configurations. Both run on the same harness with identical tools and system prompts, but they get different initial user prompts depending on whether it's the first session or a continuation.

The initializer agent runs once at project start. Its job is environment setup: create an init.sh script to run the dev server, establish a git repo with an initial commit, write a claude-progress.txt file for session logs, and — critically — generate a comprehensive feature list.

That feature list turned out to be essential. For the claude.ai clone, the initializer wrote over 200 discrete features in structured JSON format. Each entry includes a description, test steps, and a passes boolean initially set to false. Example: "New chat button creates a fresh conversation" with verification steps like "Click the 'New Chat' button" and "Verify conversation appears in sidebar."

The JSON format matters. Anthropic found that Claude is less likely to inappropriately modify or delete JSON entries compared to Markdown. The coding agent is explicitly forbidden from editing anything except the passes field. The prompt includes strongly-worded instructions: "It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality."

The coding agent handles every subsequent session. It follows a strict startup routine: run pwd to confirm the working directory, read git logs and progress files to understand recent work, review the feature list, and choose the highest-priority incomplete feature.

Before touching any code, the agent runs init.sh to start the dev server and executes a basic end-to-end test. For the web app, that meant using the Puppeteer MCP server to start a chat, send a message, and verify a response. If the app is broken, the agent fixes it before implementing anything new. This prevents cascading failures where new features compound existing bugs.

The incremental approach is non-negotiable. The agent works on exactly one feature per session. When done, it commits to git with a descriptive message, updates claude-progress.txt with a summary, and marks the feature as passing in the JSON file only after thorough testing.

Git integration proved crucial for recovery. Agents use git log to understand what changed recently and can revert bad commits if they break something. This eliminated a common failure mode where agents would waste time trying to reconstruct what happened or fix mysterious bugs introduced by previous sessions.

What This Changes For Developers

This architecture solves a problem that's been blocking production use of long-running agents: the inability to maintain coherent progress across context boundaries. If you're building agents that need to work for hours or days, you now have a proven pattern.

The feature list approach is immediately applicable. Instead of giving an agent a vague high-level goal, you can prompt it to decompose that goal into 100+ discrete, testable features upfront. This prevents premature victory declarations and gives you a clear progress metric.

The testing discipline matters more than you'd expect. Anthropic found that Claude would often make code changes, run unit tests or curl commands, but fail to verify end-to-end functionality. Explicit prompting to use browser automation and test "as a human user would" dramatically improved quality. The agent caught bugs that weren't obvious from code inspection alone.

There are still gaps. Claude's vision limitations mean it can't see browser-native alert modals through Puppeteer, so features relying on those modals tended to be buggier. But the overall pattern — initialize once, work incrementally, test thoroughly, leave clean artifacts — generalizes beyond web development.

The git-based progress tracking is elegant because it leverages tools developers already understand. You can inspect what the agent did by reading commit history. You can revert bad changes. You can diff between sessions. This makes agent behavior debuggable in a way that opaque internal state never could be.

Anthropic's work here connects to their broader research on parallel agent architectures and agent evaluation methodologies. The common thread is treating agents as software systems that need proper engineering discipline, not magic black boxes.

Try It Yourself

Anthropic published a quickstart with code examples demonstrating the initializer and coding agent setup. The core pattern looks like this:

{
  "category": "functional",
  "description": "New chat button creates a fresh conversation",
  "steps": [
    "Navigate to main interface",
    "Click the 'New Chat' button",
    "Verify a new conversation is created",
    "Check that chat area shows welcome state",
    "Verify conversation appears in sidebar"
  ],
  "passes": false
}

Each feature in your JSON list should include concrete verification steps. The coding agent reads this file at session start, picks an incomplete feature, implements it, tests it end-to-end, and only then flips passes to true.

For the startup routine, prompt your coding agent to follow this sequence every session:

# Confirm working directory
pwd

# Get up to speed on recent work
git log --oneline -20
cat claude-progress.txt

# Start the dev environment
bash init.sh

# Run basic smoke test before new work
# (specific commands depend on your project)

The init.sh script should handle whatever's needed to get your project running — starting servers, installing dependencies, setting environment variables. Write it once in the initializer session, then every coding session can just run it.

The Bottom Line

Use this pattern if you're building agents that need to work across multiple context windows on complex tasks like software development, research projects, or financial modeling. The initializer-plus-incremental-coding architecture is the first production-ready solution to the long-running agent problem.

Skip it if your agent tasks fit comfortably in a single context window or don't require maintaining state across sessions. The overhead of feature lists, git commits, and progress files isn't worth it for simple one-shot tasks.

The real opportunity here is that long-horizon agentic work just became viable. Tasks that previously required constant human intervention to maintain context can now run autonomously for hours. The risk is assuming this solves everything — Anthropic explicitly notes that multi-agent architectures with specialized testing and QA agents might outperform this single-agent approach. This is a foundation, not a ceiling.

Source: Anthropic