GitHub Copilot CLI's Rubber Duck Uses Cross-Model Review
GitHub Copilot CLI's new Rubber Duck feature uses a second AI model family to review agent plans and code. Claude Sonnet + GPT-5.4 closes 74.7% of the performance gap to Opus on difficult multi-file tasks.
TL;DR
- GitHub Copilot CLI now uses a second AI model family to review the primary agent's work before execution
- Claude Sonnet + GPT-5.4 Rubber Duck closes 74.7% of the performance gap to Claude Opus alone on difficult multi-file tasks
- Rubber Duck activates automatically at key checkpoints: after planning, after complex implementations, and before test execution
- Available now in experimental mode via the
/experimentalcommand in Copilot CLI
The Big Picture
Coding agents are getting better at writing code, but they're still making the same category of mistake: confident early decisions that compound into bigger problems downstream. You ask an agent to build a data pipeline, it picks a structure in the planning phase, and by the time you realize the architecture is wrong, you've got three files that depend on it.
GitHub's answer is Rubber Duck, a new experimental feature in Copilot CLI that uses a second model from a different AI family to review the primary agent's work. When you're running Claude as your orchestrator, Rubber Duck spins up GPT-5.4 to act as an independent reviewer. The hypothesis: models trained on different data with different techniques will catch different classes of errors.
This isn't just self-reflection with extra steps. Self-reflection has the agent review its own output, but it's still bounded by the same training biases. Rubber Duck brings in a genuinely different perspective, and GitHub's benchmarks suggest it works. On SWE-Bench Pro, Claude Sonnet paired with Rubber Duck nearly matches Claude Opus running solo, closing three-quarters of the performance gap.
The feature is live now in experimental mode. It's a meaningful shift in how coding agents validate their work, and it's worth understanding how it actually operates before you enable it.
How It Works
Rubber Duck is a focused review agent. It doesn't rewrite code or take over the session. Its job is to read the primary agent's plan or implementation and surface a short list of high-value concerns: missed details, questionable assumptions, edge cases that weren't considered.
The model pairing is deliberate. When you select a Claude model (Opus, Sonnet, or Haiku) as your orchestrator in the model picker, Rubber Duck runs GPT-5.4. GitHub is exploring other combinations, but the current setup pairs Anthropic's training approach with OpenAI's. The idea is that different model families have different blind spots, so a cross-family review catches more than a same-family review would.
Rubber Duck activates at three automatic checkpoints, chosen because they're the moments where catching an error has the highest return:
- After drafting a plan: This is the big one. If the agent picks a suboptimal architecture or makes a flawed assumption in the planning phase, everything downstream inherits that flaw. Catching it here avoids compounding errors.
- After a complex implementation: When the agent writes a large block of code, Rubber Duck reviews it for edge cases, logic errors, and cross-file conflicts.
- After writing tests, before executing them: This catches gaps in test coverage or flawed assertions before the agent runs the tests and concludes everything is fine.
Rubber Duck can also activate reactively if the agent gets stuck in a loop or can't make progress. And you can invoke it manually at any time by asking Copilot to critique its work.
GitHub made a key design choice: Rubber Duck is invoked sparingly. It doesn't review every single step. It targets the checkpoints where the signal-to-noise ratio is highest. Under the hood, Rubber Duck is invoked through Copilot's existing task tool infrastructure, the same system used for other subagents.
The benchmarks are specific. GitHub tested Rubber Duck on SWE-Bench Pro, a set of difficult, real-world coding problems from open-source repositories. Claude Sonnet 4.6 with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 alone, closing 74.7% of the performance gap between Sonnet and Opus.
The effect is more pronounced on harder problems. On tasks spanning three or more files and requiring 70+ steps, Sonnet + Rubber Duck scores 3.8% higher than Sonnet alone. On the hardest problems across three trials, the improvement jumps to 4.8%.
What This Changes For Developers
Rubber Duck shifts the trust model for coding agents. Right now, when an agent drafts a plan, you either accept it or you don't. If you accept it and the plan is flawed, you're debugging later. If you reject it, you're back to square one. Rubber Duck adds a middle layer: the agent gets a second opinion before you commit.
The real-world examples GitHub shared are telling. In one case, Rubber Duck caught that a proposed async scheduler would start and immediately exit, running zero jobs. Even if that bug were fixed, one of the scheduled tasks was an infinite loop. That's two architectural problems caught before any code was written.
In another case, Rubber Duck caught a one-line bug where a loop silently overwrote the same dictionary key on every iteration. Three of four Solr facet categories were being dropped from every search query, with no error thrown. That's the kind of bug that makes it to production because the code runs without crashing.
A third example: Rubber Duck caught a cross-file conflict where three files all read from a Redis key that the new code stopped writing. The confirmation UI and cleanup paths would have been silently broken on deploy. That's a multi-file dependency issue that's hard to catch without reading all the affected files at once.
The workflow impact is straightforward. If you're using Copilot CLI for complex refactors, architectural changes, or high-stakes tasks, Rubber Duck gives you a second set of eyes at the moments where mistakes are most expensive. You'll see critiques surface automatically when the agent hits a checkpoint, or you can request one manually.
This is particularly useful if you're working on unfamiliar codebases or making changes that touch multiple systems. The agent might not know all the implicit dependencies, but a second model with a different training background might catch what the first one missed.
For context, GitHub has been iterating on Copilot's capabilities across multiple surfaces. The Copilot SDK enables custom AI workflows, and AI-powered security detections are expanding language coverage. Rubber Duck is another step in making the agent more reliable by default.
Try It Yourself
Rubber Duck is available now in experimental mode. Here's how to enable it:
# Install GitHub Copilot CLI if you haven't already
gh extension install github/gh-copilot
# Enable experimental features
gh copilot /experimental
Once you're in experimental mode, select any Claude model (Opus, Sonnet, or Haiku) from the model picker. Rubber Duck will activate automatically at the checkpoints described above, assuming you have access to GPT-5.4.
To manually request a critique at any point, just ask Copilot to review its work. The agent will invoke Rubber Duck, incorporate the feedback, and show you what changed and why.
Where Rubber Duck helps most:
- Complex refactors and architectural changes where early mistakes compound
- High-stakes tasks where a miss is costly (production systems, security-sensitive code)
- Ensuring comprehensive test coverage before you run the tests
- Any time you want a second opinion on a plan before committing to it
The Bottom Line
Use Rubber Duck if you're working on multi-file changes, unfamiliar codebases, or anything where an early architectural mistake will cost you hours of debugging later. The benchmarks show the biggest gains on difficult, long-running tasks, so if you're using Copilot CLI for simple one-file edits, the overhead probably isn't worth it.
Skip it if you're doing straightforward, low-risk work where you'd catch mistakes quickly anyway. The feature is in experimental mode for a reason—GitHub is still tuning when and how often Rubber Duck activates.
The real opportunity here is that cross-model review might become table stakes for coding agents. If one model family consistently misses certain classes of errors and another catches them, combining them is the obvious move. The risk is that this adds latency and cost to every session, so GitHub needs to prove that the checkpoints are well-chosen and the signal-to-noise ratio stays high. Early results suggest they're on the right track, but this is one to watch as it moves out of experimental.
Source: GitHub Blog