Infrastructure Noise in AI Coding Evals: The 6-Point Leaderboard Gap

Anthropic found that infrastructure configuration alone creates a 6-point spread on Terminal-Bench scores — larger than most leaderboard gaps. Resource limits below 3x cause spurious kills; above 3x they help agents solve different problems entirely.

Infrastructure Noise in AI Coding Evals: The 6-Point Leaderboard Gap

TL;DR

  • Anthropic found that infrastructure configuration alone creates a 6-point spread on Terminal-Bench 2.0 scores — larger than most leaderboard gaps between frontier models
  • Resource limits below 3x the benchmark specs cause spurious container kills; above 3x they actively help agents solve problems they couldn't before
  • Agentic coding evals conflate model capability with infrastructure behavior — a 2-point leaderboard lead might just mean bigger VMs
  • Developers evaluating AI coding tools should treat benchmark scores under 3 percentage points as noise unless infrastructure configs are documented and matched

The Big Picture

SWE-bench and Terminal-Bench leaderboards are treated like precision instruments. A 2-point gap between models drives deployment decisions. Labs compete for tenths of a percentage point. But Anthropic's engineering team just published data showing that infrastructure configuration — CPU allocation, memory limits, container runtime settings — produces score swings that dwarf those margins.

The gap between strict resource enforcement and uncapped allocation on Terminal-Bench 2.0 was 6 percentage points. That's not measurement error. That's the difference between first and fifth place on most leaderboards.

Static benchmarks like MMLU or HumanEval score model outputs directly. The runtime environment is irrelevant. Agentic coding evals are fundamentally different: models write code, run tests, install dependencies, and iterate across multiple turns inside a live container. The infrastructure isn't a passive test harness — it's part of the problem space. Two agents with different resource budgets aren't taking the same test.

This matters because benchmark scores increasingly inform real decisions. Which model to deploy. Which vendor to pay. Which capabilities to trust in production. If a 6-point spread comes from VM specs rather than model intelligence, those decisions are built on sand.

How It Works

Anthropic runs Terminal-Bench 2.0 on Google Kubernetes Engine. During calibration, they noticed their scores didn't match the official leaderboard. Infrastructure error rates were high — 6% of tasks failed due to pod errors unrelated to the model's ability to solve problems.

The discrepancy came down to enforcement methodology. Kubernetes treats resource specs as both a guaranteed allocation and a hard kill threshold. When these are set to the same value, there's zero headroom for transient spikes. A momentary memory fluctuation kills the container even if the agent was on track to succeed. Terminal-Bench's leaderboard uses a different sandboxing provider that allows temporary overallocation without terminating containers.

To quantify the effect, Anthropic ran Terminal-Bench 2.0 across six resource configurations: strict enforcement of per-task specs (1x), 1.5x, 2x, 3x, 5x, and uncapped. Same Claude model, same harness, same task set. Only the resource limits changed.

Success rates increased monotonically with resource headroom. Infrastructure error rates dropped from 5.8% at strict enforcement to 0.5% uncapped. The drop from 1x to 3x (5.8% to 2.1%) was statistically significant at p < 0.001. More headroom means fewer spurious container kills.

From 1x through 3x, success scores fluctuated within noise margins (p=0.40). Most tasks crashing at 1x would have failed anyway — the agent explored, hit a resource wall, got preempted, but was never on a correct solution path.

Above 3x, the pattern changed. Between 3x and uncapped, infrastructure errors dropped 1.6 percentage points while success jumped almost 4 percentage points. The extra resources enabled agents to try approaches that only work with generous allocations: pulling large dependencies, spawning expensive subprocesses, running memory-intensive test suites. Tasks like rstan-to-pystan and compile-compcert showed significant success rate improvements with memory headroom.

The total lift from 1x to uncapped was 6 percentage points (p < 0.01). Container runtimes enforce resources via two parameters: a guaranteed allocation and a hard limit. When these are equal, transient spikes cause OOM kills. When separated, containers get breathing room without removing meaningful resource pressure.

What This Changes For Developers

Up to roughly 3x Terminal-Bench specs, additional resources fix infrastructure reliability problems — transient resource spikes that kill containers prematurely. The eval gets more stable without getting easier. This is what the Terminal-Bench maintainers' sandboxing provider does implicitly.

Above 3x, additional resources actively help agents solve problems they couldn't before. Limits change what the eval measures. Tight limits reward efficient strategies. Generous limits reward agents that exploit all available resources. Both are legitimate things to test, but collapsing them into a single score without specifying resource configuration makes real-world generalizability impossible to interpret.

On bn-fit-modify, a Terminal-Bench task requiring Bayesian network fitting, some models immediately install the full Python data science stack: pandas, networkx, scikit-learn, and dependencies. Under generous limits, this works. Under tight ones, the pod runs out of memory during installation before the agent writes solution code. A leaner strategy exists — implementing the math from scratch using only the standard library — and some models default to it. Others don't. Resource configuration determines which approaches succeed.

Anthropic replicated the finding across different Claude models. The direction was consistent; magnitude varied. The same trends appear to hold on non-Claude models, though rigorous testing is pending. They also ran a crossover experiment on SWE-bench, varying total RAM up to 5x baseline across 227 problems with 10 samples each. Scores increased monotonically with RAM, though the magnitude was smaller: 1.54 percentage points higher at 5x than 1x. SWE-bench tasks are less resource-intensive, so a smaller effect is expected, but resource allocation isn't neutral there either.

Resource allocation isn't the only hidden variable. Time limits matter. Cluster health matters. Hardware specs, concurrency levels, even egress bandwidth can influence scores. Anthropic observed anecdotally that pass rates fluctuate with time of day, likely because API latency varies with traffic patterns. They haven't formally quantified this, but it illustrates the core problem: the boundary between model capability and infrastructure behavior is blurrier than a single benchmark score suggests.

This is particularly relevant for developers building managed agent systems where infrastructure decisions directly impact production performance, not just eval scores.

Try It Yourself

If you're running agentic coding evals, here's how to calibrate resource limits properly. Container runtimes enforce resources via two parameters: a guaranteed allocation (the floor) and a hard kill threshold (the ceiling). Specify both, not a single pinned value.

A single exact spec sets guaranteed allocation equal to kill threshold, leaving zero margin for transient spikes. Separating the two gives containers breathing room to avoid spurious OOM kills while still enforcing a hard ceiling that prevents score inflation.

The band between floor and ceiling should be calibrated so scores at both endpoints fall within noise of each other. For Terminal-Bench 2.0, a 3x ceiling over per-task specs cut infrastructure error rates by two-thirds (5.8% to 2.1%, p < 0.001) while keeping score lift modest and within noise (p = 0.40). That's a reasonable tradeoff: the infrastructure confounder is neutralized without removing meaningful resource pressure.

The exact multiplier varies by benchmark and task distribution. Document it. Report it. The empirical calibration principle is general, but the numbers are not.

The Bottom Line

Use agentic coding benchmark scores as directional signals, not precision measurements. Treat leaderboard differences below 3 percentage points as noise unless infrastructure configurations are documented and matched. The observed spread across moderate resource configurations in Terminal-Bench is just below 2 percentage points. Naive binomial confidence intervals already span 1-2 percentage points — infrastructure confounders stack on top of that, not within it. At allocation extremes, the spread reaches 6 points.

Skip benchmarks that don't publish resource specs and enforcement methodology. If you're running your own evals, treat resource configuration as a first-class experimental variable — document and control it with the same rigor as prompt format or sampling temperature. For production deployments, this matters even more: sandboxing and resource limits directly determine what your agents can and can't do.

The real risk here is that benchmark-driven decisions are built on uncontrolled variables. A few-point lead might signal a genuine capability gap — or it might just be a bigger VM. Until resource methodology is standardized, you can't tell the difference from a leaderboard alone.

Source: Anthropic