Claude 3.5 Sonnet Hits 49% on SWE-Bench: How Anthropic Built It

Claude 3.5 Sonnet hit 49% on SWE-bench Verified with a minimal agent scaffold: two tools, one prompt, maximum model control. Here's the exact architecture Anthropic used and why tool design matters as much as model capability.

Claude 3.5 Sonnet Hits 49% on SWE-Bench: How Anthropic Built It

TL;DR

  • Claude 3.5 Sonnet (upgraded) scored 49% on SWE-bench Verified, beating the previous SOTA of 45%
  • Anthropic's agent uses just two tools: a Bash executor and a string-replacement file editor
  • The scaffold is intentionally minimal—most control stays with the model, not hardcoded workflows
  • Tool design matters as much as model capability: Anthropic spent significant effort making tools "error-proof" and self-documenting
  • If you're building coding agents with Claude, this is your blueprint

The Big Picture

SWE-bench Verified is the closest thing we have to a real-world test of AI coding ability. It's not LeetCode. It's not interview questions. It's 500 actual GitHub issues from popular Python repos, graded against the unit tests from the original pull requests that fixed them.

Anthropic's upgraded Claude 3.5 Sonnet just hit 49% on this benchmark. That's a 4-point jump over the previous state-of-the-art and a 16-point improvement over the original Claude 3.5 Sonnet. More importantly, Anthropic published the exact agent scaffold they used to achieve it—the prompt, the tools, the design decisions.

This isn't a press release. It's a technical deep-dive into how they built an agent that gives maximum control to the model while keeping the scaffolding dead simple. Two tools. One prompt. No hardcoded workflows.

The takeaway: if you're building coding agents, tool design is as critical as model selection. Anthropic spent serious effort making their tools "error-proof"—absolute paths instead of relative, string replacement instead of line-based edits, detailed descriptions that preempt common mistakes. That work paid off.

How It Works

SWE-bench doesn't test models in isolation. It tests agents—the combination of a model and the software scaffolding around it. The scaffolding generates prompts, parses model output, manages the interaction loop, and decides when to stop.

Anthropic's design philosophy: give the model as much control as possible. The agent has a prompt, a Bash tool for executing commands, and an Edit tool for viewing and modifying files. That's it. The model decides how to pursue the problem. No rigid state machines. No forced step transitions.

The prompt is short. It outlines a suggested approach—explore the repo, reproduce the error, edit the source, rerun the script, think about edge cases—but the model is free to deviate. If you're not token-sensitive, the prompt explicitly encourages long responses. The model continues sampling until it decides it's finished or hits the 200k context limit.

The Bash tool is straightforward: one parameter (the command to run), but the description does heavy lifting. It tells the model about escaping inputs, lack of internet access, persistent state, how to inspect specific line ranges with sed, and how to run long-lived commands in the background. These details matter. Anthropic tested the tools across a wide variety of agentic tasks, uncovered ways the model might misunderstand the spec, then edited the descriptions to preempt those problems.

The Edit tool is more complex. It handles viewing, creating, and editing files with five commands: view, create, str_replace, insert, and undo_edit. The key design choice: string replacement for edits. The model specifies old_str to replace with new_str in a given file. The replacement only occurs if there's exactly one match. If there are zero or multiple matches, the model gets an error message and retries.

Why string replacement instead of line-based edits? Reliability. Anthropic experimented with several strategies and found this one had the highest success rate. They also made the tool require absolute paths to prevent the model from messing up relative paths after moving out of the root directory. Small decisions like this compound.

The tool descriptions are verbose—intentionally. They include edge cases, usage notes, and warnings about common pitfalls. Anthropic treats tool interface design for models the same way you'd treat UI design for humans: with serious attention to detail. If you're building agents, read their guide on writing effective tools—it's the best resource on this topic.

What This Changes For Developers

The upgraded Claude 3.5 Sonnet shows improved self-correction. It tries multiple solutions instead of getting stuck in loops. It's tenacious—some successful runs took hundreds of turns and over 100k tokens. That's expensive, but it works.

The benchmark results tell the story:

  • Claude 3.5 Sonnet (upgraded): 49%
  • Previous SOTA: 45%
  • Claude 3.5 Sonnet (original): 33%
  • Claude 3 Opus: 22%

All scores use the same agent scaffold. The model improvement is real.

For developers building coding agents, this changes the calculus. You don't need complex scaffolding. You don't need to hardcode workflows. You need well-designed tools and a model that can reason through multi-step problems.

The typical agent behavior: the model views the repo structure, creates a reproduction script, runs it to confirm the bug, edits the source code, reruns the script to verify the fix, and submits. In one example from Anthropic's logs, the model fixed a RidgeClassifierCV parameter issue in 12 steps. It identified that the class wasn't passing store_cv_values to its parent constructor, added the parameter to the init signature, and passed it through to super(). Clean, minimal, correct.

But not all tasks are that clean. Some took over 100 turns. Some failed because the model solved the problem at the wrong level of abstraction—applying a bandaid instead of a deeper refactor. Some failed because the model couldn't see the hidden unit tests it was being graded against and "thought" it had succeeded when it hadn't.

Try It Yourself

Anthropic used the SWE-Agent framework as a foundation for their agent code. The full prompt and tool specs are in the source article. Here's the core prompt structure:

<uploaded_files>{location}</uploaded_files>
I've uploaded a python code repository in the directory {location}. 
Consider the following PR description:

<pr_description>{pr_description}</pr_description>

Can you help me implement the necessary changes to the repository 
so that the requirements specified in the <pr_description> are met?

Follow these steps to resolve the issue:
1. Explore the repo to familiarize yourself with its structure
2. Create a script to reproduce the error and execute it
3. Edit the sourcecode of the repo to resolve the issue
4. Rerun your reproduce script and confirm that the error is fixed
5. Think about edgecases and make sure your fix handles them as well

Your thinking should be thorough and so it's fine if it's very long.

The Bash tool schema is minimal—just a command parameter—but the description includes critical details about escaping, internet access, persistent state, and background processes. The Edit tool uses string replacement with old_str and new_str parameters, requiring absolute paths and exact matches.

If you're building similar agents, start with these tool designs. Test them across diverse tasks. Watch for ways the model misunderstands the spec. Iterate on the descriptions. The model's performance depends on it.

The Bottom Line

Use Claude 3.5 Sonnet (upgraded) if you're building coding agents that need to handle real-world software engineering tasks. The 49% SWE-bench score isn't just a benchmark win—it's proof that the model can self-correct, try multiple approaches, and work through complex multi-step problems.

Skip the complex scaffolding. Anthropic's minimal approach—two tools, one prompt, maximum model control—outperformed systems with rigid workflows. Invest your time in tool design instead. Make your tools error-proof. Write detailed descriptions. Test them thoroughly.

The real opportunity here is for developers to push beyond 49%. Anthropic didn't implement multimodal file viewing, which hurt performance on Matplotlib tasks. They didn't optimize for token efficiency—some runs burned through 100k+ tokens. There's low-hanging fruit.

The real risk is underestimating tool design. A poorly specified tool will tank your agent's performance no matter how good the underlying model is. Anthropic spent significant effort on this. You should too.

Source: Anthropic