How Cline Climbed from 47% to 57% on Terminal Bench in One Weekend
Cline jumped from 47% to 57% on Terminal Bench in one weekend using hill climbing: run evals, diagnose failures, fix one thing, measure again. Here's the exact process, including the Harbor + Modal setup that makes it practical.
TL;DR
- Cline improved from 47% to 57% on Terminal Bench by systematically diagnosing failures and shipping targeted fixes
- Hill climbing is an iterative process: run evals, analyze failures, fix one thing, measure again, keep what works
- Harbor + Modal lets you run 89 coding tasks in parallel in under an hour instead of sequentially over many hours
- This process works for any AI coding agent — Cursor, Claude Code, OpenHands, or your own custom setup
The Big Picture
A potential partner asked Cline for benchmark numbers. The team looked at third-party results and found themselves behind Cursor, Claude Code, and other agents. They had no systematic way to measure performance or diagnose what was breaking.
Over one weekend, three engineers ran Cline against Terminal Bench's 89 real-world coding tasks, diagnosed every failure, and shipped fixes. The score jumped from 47% to 57%, putting Cline ahead of Claude Code, OpenHands, and OpenCode.
This wasn't magic. It was hill climbing: run the agent, measure the score, change one thing, run again. Keep changes that improve the score. Revert changes that don't. Repeat until you stop climbing.
Most coding agent evaluations are either single-turn or saturated to the point of uselessness. Terminal Bench tests the entire agentic flow — file manipulation, command execution, error recovery, verification. It grades the full sequence of steps an agent performs, not just whether it can complete a single API call.
The process outlined here works for any model or agent combo. If you're building on Claude Code, Cursor, Gemini CLI, or your own custom agent, you can use the same framework to systematically improve performance.
How It Works
Hill climbing requires three components: a benchmark dataset, an evaluation harness, and a way to run tasks in parallel.
The benchmark: Terminal Bench 2.0 contains 89 diverse coding tasks. Each task runs in an isolated sandbox with a verifier that checks whether the agent succeeded. Tasks range from simple file operations to multi-step debugging and deployment workflows.
The harness: Harbor is an agent evaluation framework built by the creators of Terminal Bench. It abstracts sandbox management, agent loops, and rollout monitoring. A Harbor task is just a directory. Harbor spins up the sandbox, runs the agent, verifies the result, and tears everything down. It supports multiple datasets, so you can swap benchmarks depending on what you're optimizing for.
The infrastructure: Running 89 tasks sequentially on your local machine takes hours. Harbor integrates with Modal to parallelize tasks across cloud containers. With Modal, a full eval run completes in 35-45 minutes instead of half a day. This speed is critical — you can't iterate quickly if each experiment takes six hours.
The setup requires Python, Docker, and uv. You'll need API keys for at least one LLM provider. OpenRouter is recommended because it's been the most reliable for evals — other providers hit rate limits or have infrastructure issues during heavy testing.
Once the infrastructure is running, the hill climbing loop is straightforward. Establish a baseline by running a full 89-task sweep with your current config. Record the score. Analyze failures using Harbor's summarize command, which categorizes why tasks failed. Common failure patterns for Cline included timeout errors, missing file verification, and command exit codes not being surfaced to the agent.
Each failure pattern becomes a hypothesis. Cline's default 600-second timeout was too short for long-running build tasks. The fix: increase timeout to 2400 seconds. Cline assumed success without verifying expected files existed. The fix: require verification before marking tasks complete. Command exit codes weren't surfaced, so Cline didn't know when commands failed. The fix: surface exit codes in the agent loop.
After identifying a fix, you A/B test it. Run the baseline config and the modified config side-by-side. Compare scores. If the change improves performance, merge it. If it doesn't, revert and try something else.
Single runs can be noisy. Scores vary by several percentage points between runs due to model non-determinism and infrastructure variance. When results are close, run the same config 3-6 times and average the scores. Cline's team ran Opus 4.5 with thinking tokens enabled six times and got scores of 0.49, 0.43, 0.45, 0.44, 0.48, and 0.46. The average was 0.458, the median 0.455. This gives you a reliable signal when comparing configs that score within a few points of each other.
Harbor supports testing different branches or commits without merging code. You can pass --ak github_user=cline and --ak commit_hash=main to test the main branch, or --ak commit_hash=saoud/fix-exit-codes to test a PR branch. This lets you run multiple experiments in parallel using nohup and compare results without polluting your main branch.
What This Changes For Developers
Before this process, Cline had no systematic way to measure whether a code change improved agent performance. Changes were shipped based on intuition or anecdotal evidence. Now, every change can be validated against a standardized benchmark before it ships.
This matters because AI coding agents fail in non-obvious ways. A prompt tweak that improves performance on simple file operations might break multi-step debugging tasks. A timeout increase that fixes build tasks might cause the agent to waste time on unsolvable problems. Without evals, you're flying blind.
The hill climbing process also surfaces failure modes you wouldn't discover through manual testing. Cline's team found that the agent wasn't verifying file creation before marking tasks complete. This bug only appeared in specific task types and would have been nearly impossible to catch without running the full benchmark.
For teams building on top of AI coding agents, this framework provides a way to validate customizations. If you're orchestrating multiple agents or tweaking prompts for domain-specific tasks, you can measure whether your changes actually improve performance or just feel better.
The infrastructure cost is non-trivial. Running 89 tasks in parallel on Modal isn't free. But if you're building a product on top of an AI coding agent, the cost of not knowing whether your changes work is higher.
Try It Yourself
Install Harbor and set up your environment:
pip install harbor
# or
uv tool install harbor
# Set up Modal for parallel runs
pip install modal
modal setup
Configure your environment variables:
export CPUS=14
export MEMORY_MB=8192
export OPENROUTER_API_KEY="sk-or-v1-yourkey"
export API_KEY=$OPENROUTER_API_KEY
Run a quick test to verify your setup:
source ~/.env
harbor run \
-d terminal-bench@2.0 \
-a cline-cli \
-m openrouter:anthropic/claude-opus-4.5 \
--env modal \
-n 3 \
-l 3
This runs 3 tasks in parallel. Each task takes about 15 minutes due to Harbor setup time. If this completes without errors, your setup works.
Run a full baseline eval:
source ~/.env && export API_KEY=$OPENROUTER_API_KEY
harbor run \
-d terminal-bench@2.0 \
-a cline-cli \
-m openrouter:anthropic/claude-opus-4.6 \
--env modal \
--ak thinking=6000 \
--ak timeout=2400 \
-n 89 -l 89 \
--override-cpus $CPUS --override-memory-mb $MEMORY_MB
This runs all 89 tasks with a 40-minute timeout per task. The full run takes 40-50 minutes including Modal setup.
Analyze failures:
harbor jobs summarize ./jobs/LATEST --failed -m haiku
This categorizes why tasks failed. Look for patterns: timeouts, missing files, command failures, inference errors. Each pattern is a hypothesis for what to fix next.
The Bottom Line
Use this if you're building or customizing an AI coding agent and need to validate that your changes actually work. Use this if you're evaluating multiple agents and want objective data instead of marketing claims. Use this if you're trying to push an agent from "mostly works" to "reliably works."
Skip this if you're just using an AI coding agent as-is and don't care about performance optimization. Skip this if you don't have the infrastructure budget to run parallel evals — the sequential version takes too long to iterate quickly.
The real opportunity here isn't just improving benchmark scores. It's building a feedback loop that lets you diagnose and fix failure modes systematically. Cline went from 47% to 57% in one weekend because they had a process for identifying what was broken and validating fixes. Without that process, they'd still be guessing.
The risk is over-optimizing for a single benchmark. Terminal Bench tests real-world coding tasks, but it's not exhaustive. A change that improves Terminal Bench scores might hurt performance on tasks outside the benchmark. Use this as one signal among many, not as the only metric that matters.
Source: Cline