16 Parallel Claudes Built a C Compiler That Boots Linux

Anthropic researcher ran 16 parallel Claude instances for 2 weeks to build a 100,000-line C compiler that boots Linux. No human intervention. $20k in API costs. This is what autonomous agents can barely do today — and what they'll do routinely tomorrow.

16 Parallel Claudes Built a C Compiler That Boots Linux

TL;DR

  • Anthropic researcher ran 16 Claude instances in parallel for 2 weeks to build a C compiler from scratch
  • The compiler is 100,000 lines of Rust, compiles Linux 6.9 on three architectures, and cost $20k in API calls
  • Agents worked autonomously using git locks for coordination, no orchestration layer required
  • This is a capability benchmark showing what's barely possible today — and what will be routine tomorrow

The Big Picture

Nicholas Carlini, a researcher on Anthropic's Safeguards team, just published the most ambitious autonomous agent experiment I've seen this year. He tasked 16 parallel Claude instances with building a production-grade C compiler capable of compiling the Linux kernel. No human intervention during development. No internet access. Just Claude, a test harness, and nearly 2,000 autonomous coding sessions.

The result is a 100,000-line Rust compiler that boots Linux 6.9 on x86, ARM, and RISC-V. It compiles QEMU, FFmpeg, SQLite, and passes 99% of the GCC torture test suite. It can compile and run Doom. The entire project consumed 2 billion input tokens and 140 million output tokens over two weeks, costing just under $20,000.

This isn't about the compiler itself — though it's an impressive artifact. It's about what the experiment reveals: we've crossed a threshold where LLMs can autonomously execute complex, multi-week engineering projects. The scaffolding Carlini built to enable this is deceptively simple, and the lessons he learned about keeping agents on track matter far more than the code they produced.

If you've been following Anthropic's work on autonomous coding modes, this is the logical extreme of that direction.

How It Works

The core harness is brutally simple. Each Claude instance runs in a Docker container with a mounted git repo. When one task finishes, it immediately picks up the next. The loop runs forever.

#!/bin/bash
while true; do
    COMMIT=$(git rev-parse --short=6 HEAD)
    LOGFILE="agent_logs/agent_${COMMIT}.log"
    claude --dangerously-skip-permissions \
           -p "$(cat AGENT_PROMPT.md)" \
           --model claude-opus-X-Y &> "$LOGFILE"
done

The agent prompt tells Claude to break problems into small pieces, track progress, figure out what to work on next, and keep going until it's perfect. There's no orchestration agent. No complex communication protocol. Just git for synchronization.

Coordination happens through file-based locks. When Claude wants to work on a task, it writes a lock file to current_tasks/parse_if_statement.txt. If another agent tries to claim the same task, git's merge semantics force it to pick something else. When the task is done, Claude pulls from upstream, merges changes from other agents, pushes its work, and removes the lock. Merge conflicts are frequent, but Claude handles them.

This is a research prototype, not production infrastructure. There's no fancy orchestration. No agent-to-agent messaging beyond what git provides. Each Claude instance decides independently what to work on next, usually picking the "next most obvious" problem. When stuck, agents maintain running docs of failed approaches and remaining tasks.

The real engineering work went into the test harness. Carlini spent most of his time designing the environment around Claude — the tests, the feedback loops, the progress tracking — so agents could orient themselves without human help. The harness had to be nearly perfect, because Claude will autonomously solve whatever problem you give it. If your tests are wrong, Claude solves the wrong problem.

He built a continuous integration pipeline that prevented new commits from breaking existing functionality. He designed tests to avoid context window pollution — no thousands of useless bytes, just a few lines of output and structured logfiles that Claude could grep. He added a --fast flag that runs a 1% or 10% random sample of tests, deterministic per-agent but random across VMs, so each agent covers different ground without wasting hours on full test runs.

The hardest parallelism challenge came when agents started compiling the Linux kernel. Unlike a test suite with hundreds of independent tests, the kernel is one giant task. Every agent hit the same bug, fixed it, and overwrote each other's changes. Having 16 agents didn't help because they were all stuck on the same bottleneck.

The fix was clever: use GCC as an online oracle. The harness randomly compiled most of the kernel with GCC, and only a subset of files with Claude's compiler. If the kernel worked, the problem wasn't in Claude's subset. If it broke, the harness refined further by recompiling some files with GCC. This let each agent work in parallel, fixing different bugs in different files, until Claude's compiler could handle everything.

Carlini also experimented with specialized agent roles. One agent coalesced duplicate code. Another optimized the compiler's own performance. A third focused on generating efficient compiled output. One agent critiqued the design from a Rust developer's perspective and made structural improvements. Another maintained documentation.

What This Changes For Developers

This experiment is a capability benchmark. Carlini designed it to stress-test the limits of what LLMs can barely achieve today, to help us prepare for what they'll reliably achieve tomorrow. He's been using the C compiler project as a benchmark across the entire Claude 4 model series.

Previous Opus 4 models could barely produce a functional compiler. Opus 4.5 was the first to cross the threshold of passing large test suites, but couldn't compile real projects. Opus 4.6 is the first model that can autonomously execute a multi-week engineering project of this complexity.

The cost structure is revealing. $20,000 in API calls sounds expensive — it's more than even the most expensive Claude Max plans. But it's a fraction of what it would cost to hire a team to build this from scratch. And the cost is dropping fast. What costs $20k today will cost $2k next year.

The workflow implications are significant. Early models were useful for tab-completion. Then they could complete a function body from a docstring. Claude Code brought agents into the mainstream for pair programming. But all of these assume a user defines a task, the LLM runs for a few minutes, and the user provides follow-up.

Agent teams break that assumption. You can now define a complex, multi-week project and let Claude execute it autonomously. This changes the scope of what's achievable. You become more ambitious with your goals because the bottleneck isn't your time — it's the quality of your test harness and the clarity of your requirements.

But there are real risks. When a human sits with Claude during development, they catch errors in real time and ensure consistent quality. With autonomous systems, it's easy to see tests pass and assume the job is done, when that's rarely the case. Carlini used to work in penetration testing, exploiting vulnerabilities in products from large companies. The thought of programmers deploying software they've never personally verified is a real concern.

Try It Yourself

The compiler source code is public. Download it, read through the code, try it on your favorite C projects. Carlini says the best way to understand what language models can do is to push them to their limits and study where they break down.

The compiler has limitations. It lacks a 16-bit x86 compiler needed to boot Linux out of real mode, so it calls out to GCC for that phase. It doesn't have its own assembler and linker — those are still buggy. It successfully builds many projects, but not all. The generated code is inefficient — even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled. The Rust code quality is reasonable but nowhere near what an expert would produce.

These limitations aren't bugs — they're the ceiling. Carlini tried hard to fix them and couldn't. New features and bugfixes frequently broke existing functionality. The compiler has nearly reached the limits of Opus 4.6's abilities. As one example, Opus was unable to implement a 16-bit x86 code generator that stayed under Linux's 32k code limit. Claude simply cheats and calls out to GCC for that phase.

If you want to experiment with multi-agent harnesses yourself, the key lessons are: write extremely high-quality tests, design for Claude's limitations (context window pollution, time blindness), make parallelism easy by breaking work into independent tasks, and use specialized agent roles where it makes sense.

The Bottom Line

Use this approach if you have a well-defined, testable engineering problem that's too large for a single developer but doesn't require deep architectural judgment. Skip it if your problem space is ambiguous, your tests are weak, or you can't afford to verify the output thoroughly. The real risk here isn't that autonomous agents will write bad code — it's that we'll trust passing tests as proof of correctness when they're not.

Carlini didn't expect this to be anywhere near possible so early in 2026. The rapid progress in both language models and the scaffolds we use to interact with them opens the door to writing an enormous amount of new code. He expects the positive applications to outweigh the negative, but we're entering a new world that requires new strategies to navigate safely.

The compiler itself is impressive, but the real artifact is the harness design and the lessons learned. If you're building autonomous agent systems, study this experiment closely. The gap between "Claude can do this with supervision" and "Claude can do this autonomously for two weeks" is entirely in the quality of your test infrastructure and the clarity of your feedback loops.

Source: Anthropic