Anthropic Built a C Compiler with 16 Parallel Claudes

Anthropic researcher ran 16 parallel Claude instances for two weeks to build a 100,000-line C compiler from scratch. It compiles Linux, costs $20k in API calls, and reveals where autonomous agent teams hit their limits.

Anthropic Built a C Compiler with 16 Parallel Claudes

TL;DR

  • Anthropic researcher ran 16 Claude instances in parallel for 2 weeks to build a C compiler from scratch
  • The compiler is 100,000 lines of Rust, compiles Linux 6.9 on three architectures, and cost $20k in API calls
  • Agent teams work autonomously without human intervention by using git locks, specialized roles, and continuous testing
  • This approach hit real limits: agents broke existing code when adding features, and some tasks exceeded Opus 4.6's capabilities

The Big Picture

Nicholas Carlini, a researcher on Anthropic's Safeguards team, just published what might be the most ambitious autonomous agent experiment yet. He tasked 16 parallel Claude instances with building a production-grade C compiler capable of compiling the Linux kernel. No human intervention during development. Just Claude, git, and a test harness.

The result: a 100,000-line Rust compiler that boots Linux 6.9 on x86, ARM, and RISC-V. It compiles QEMU, FFmpeg, SQLite, and passes 99% of the GCC torture test suite. Total cost: $20,000 in API calls over 2,000 Claude Code sessions.

This isn't about the compiler itself—though it's an impressive artifact. It's about what Carlini learned designing harnesses for long-running autonomous agent teams. How do you keep 16 Claudes working in parallel without stepping on each other? How do you write tests that guide agents without human oversight? Where does this approach break down?

The answers matter because this is where AI coding tools are headed. Not pair programming sessions where you babysit Claude. Autonomous teams that work for hours or days while you sleep.

How It Works

The core harness is deceptively simple. Claude runs in an infinite loop inside a Docker container. When it finishes one task, it immediately picks up the next. No waiting for human input. No status updates. Just continuous progress until the problem is solved.

The bash script is bare-bones: spawn Claude Code with a prompt, log the output, repeat forever. The prompt tells Claude to break problems into small pieces, track what it's working on, figure out what to do next, and keep going until it's perfect. Claude has no choice—the loop runs forever.

Parallelization adds the real complexity. Carlini spins up 16 Docker containers, each with a clone of a shared git repo. Agents synchronize using a simple locking mechanism: before working on a task, Claude writes a lock file to current_tasks/. If two agents try to claim the same task, git's synchronization forces the second one to pick something else.

When an agent finishes, it pulls from upstream, merges changes from other agents, pushes its work, and removes the lock. Merge conflicts happen constantly. Claude figures them out. Then the container dies, a fresh one spawns, and the cycle repeats.

There's no orchestration agent. No communication protocol between instances. Each Claude decides independently what to work on next. In practice, Claude picks the "next most obvious" problem. When stuck, it maintains running docs of failed approaches and remaining tasks. You can watch this play out in the git history—agents taking locks, solving problems, moving on.

This approach builds on Anthropic's earlier work with multi-agent harnesses for long-running applications, but pushes the concept further by removing human oversight entirely.

Specialization emerged naturally. While most agents worked on compiler features, Carlini assigned specific roles: one agent coalesced duplicate code, another optimized compiler performance, a third focused on output efficiency. One agent critiqued the project from a Rust developer's perspective and made structural improvements. Another maintained documentation.

What This Changes For Developers

The real insight isn't that Claude can write a compiler. It's that the bottleneck shifted entirely to test design.

Carlini spent most of his time building the environment around Claude—the tests, the feedback loops, the verification harness. Get that right, and Claude works autonomously for days. Get it wrong, and agents solve the wrong problem or spin in circles.

High-quality tests became critical. Claude will work autonomously to solve whatever problem you give it, so the task verifier must be nearly perfect. Carlini found high-quality compiler test suites, wrote verifiers for open-source packages, watched for mistakes Claude made, then designed new tests to catch those failure modes.

Near the end, Claude started breaking existing functionality every time it added a feature. The fix: a continuous integration pipeline with stricter enforcement. New commits couldn't land if they broke existing code.

Context window management mattered more than expected. Test harnesses shouldn't print thousands of useless bytes. Print a few lines of output, log everything else to files Claude can grep when needed. Logfiles should be machine-readable: if there's an error, write ERROR on the same line so grep finds it. Pre-compute aggregate statistics so Claude doesn't waste time recalculating them.

Time blindness is real. Claude can't tell time and will happily spend hours running tests instead of making progress. Carlini added a --fast flag that runs a 1% or 10% random sample. The subsample is deterministic per-agent but random across VMs, so Claude still covers all files but each agent can identify regressions quickly.

Parallelism required creative problem decomposition. When there are hundreds of failing tests, parallelization is trivial—each agent picks a different test. But when agents started compiling the Linux kernel, they got stuck. Every agent hit the same bug, fixed it, then overwrote each other's changes.

The solution: use GCC as an oracle. Randomly compile most of the kernel with GCC, only the remaining files with Claude's compiler. If the kernel works, the problem isn't in Claude's subset. If it breaks, refine further. This let each agent work in parallel, fixing different bugs in different files, until Claude's compiler could handle everything.

Try It Yourself

The compiler source is available on GitHub. It's a clean-room implementation—Claude never had internet access during development. Zero dependencies except the Rust standard library.

Fair warning: this is a research prototype, not a production tool. It lacks a 16-bit x86 compiler for booting Linux out of real mode (it cheats and calls GCC for that phase). It doesn't have its own assembler and linker—those are still buggy. The generated code is inefficient; even with all optimizations enabled, it's slower than GCC with optimizations disabled.

But it works. It compiles and runs Doom. It builds Redis, PostgreSQL, libjpeg, Lua. It passes 99% of the GCC torture test suite. For a fully autonomous agent team working without human oversight, that's remarkable.

The project consumed 2 billion input tokens and generated 140 million output tokens. At $20,000 total, that's expensive compared to Claude Max plans. But it's a fraction of what it would cost to hire a team to build this from scratch.

The Bottom Line

Use this approach if you're building complex systems where the test harness can be more precise than human judgment, and where parallelizable work exists at scale. Skip it if your project requires nuanced design decisions that tests can't capture, or if you're working in domains where autonomous agents might introduce security risks you can't easily verify.

The real risk here isn't technical—it's cultural. Carlini used to work in penetration testing, exploiting vulnerabilities in products from large companies. The thought of programmers deploying software they've never personally verified concerns him. When a human sits with Claude during development, they catch errors in real time. With autonomous systems, passing tests don't guarantee the job is done.

This experiment hit the limits of Opus 4.6's capabilities. New features frequently broke existing functionality. Some tasks—like implementing a 16-bit x86 code generator under 32kb—exceeded what the model could do. The Rust code quality is reasonable but nowhere near what an expert would produce.

Yet Carlini didn't expect this to be anywhere near possible in early 2026. The rapid progress in both language models and the scaffolds we use to interact with them opens the door to writing an enormous amount of new code. That's exciting and unsettling in equal measure. We're entering a world where the bottleneck isn't writing code—it's designing the systems that verify it.

Source: Anthropic