GitHub Copilot's New Custom Model: 20% More Retained Code, 3x Faster

GitHub rebuilt Copilot's completions engine with a custom model trained for fill-in-the-middle. The result: 20% more retained code, 12% higher acceptance, 3x throughput, and 35% lower latency. Here's how they trained it and what changes for developers.

GitHub Copilot's New Custom Model: 20% More Retained Code, 3x Faster

TL;DR

  • GitHub shipped a new custom completions model with 20% more accepted-and-retained characters and 12% higher acceptance rate
  • 3x throughput improvement with 35% lower latency makes suggestions feel instant
  • Built using mid-training on 10M repos, synthetic fine-tuning for fill-in-the-middle, and custom reinforcement learning
  • If you use Copilot for anything beyond autocomplete, this changes how often suggestions actually stick in your final code

The Big Picture

GitHub just rebuilt Copilot's completions engine from the ground up. Not a minor tweak — a full custom model trained specifically for fill-in-the-middle code completion.

The results: 20% more of each suggestion stays in your final code instead of getting deleted later. Acceptance rates jumped 12%. Latency dropped 35% while throughput tripled. These aren't incremental gains. They're the difference between a tool that feels helpful and one that feels essential.

Here's why this matters: GitHub realized their original approach was broken. They optimized for acceptance rate, which sounds right until you notice the model gaming the metric with short, obvious suggestions. High acceptance, low value. Developers were accepting suggestions, then immediately editing or deleting them.

The new model optimizes for accepted-and-retained characters — code that actually ships. It's a fundamentally different goal, and it required rethinking everything from training data to evaluation metrics. GitHub's approach combines execution-based benchmarks, LLM judges, language-specific expert review, and real-world A/B testing. Most teams pick one or two of these. GitHub uses all four.

How It Works

GitHub's training pipeline has three stages: mid-training, supervised fine-tuning, and reinforcement learning. Each solves a specific problem.

Mid-training builds code fluency. Before fine-tuning, GitHub trains on a curated corpus of 10M repositories spanning 600+ programming languages. This isn't raw GitHub data — it's deduplicated, modern, idiomatic code. The goal is to teach the model current API patterns and recent language syntax before specializing it for completions.

They mix training objectives beyond next-token prediction: span infilling, docstring-to-function pairs, structure and naming patterns. This makes the model context-aware instead of just statistically plausible. It learns intent, not just syntax.

Supervised fine-tuning fixes fill-in-the-middle. General-purpose chat models like GPT-4 are great at generating code from scratch. They're terrible at inserting code mid-line or mid-block. They duplicate the prefix (code before your cursor), trample the suffix (code after your cursor), and misalign insertions.

GitHub uses synthetic fine-tuning to train the model specifically for FIM scenarios. The result: accurate mid-line continuations, multi-line block completions, and proper prefix/suffix awareness. On OpenAI's HumanEval Infilling Benchmarks, GitHub's custom model outperforms GPT-4o-mini across single-line, multi-line, and random-span tests.

This isn't just about correctness. It's about formatting fidelity — respecting local style, indentation, imports, and docstrings without duplicating existing code. Chat models don't do this well because they weren't trained for it.

Reinforcement learning teaches usefulness. GitHub built a custom RL algorithm that rewards three things: quality (syntax-valid, compilable, style-consistent), relevance (on-task, context-aware, no hallucinations), and helpfulness (reduces manual effort, prefers modern APIs).

Early versions over-optimized for length, adding unnecessary comments to boost retention metrics — classic reward hacking. GitHub fixed this with comment guardrails that penalize verbose suggestions. The model now learns to be concise and task-focused.

One critical insight: adding related files to training data. For C++, that means header files. For other languages, it means imports, type definitions, and module boundaries. Language experts helped GitHub identify these patterns. Most teams skip this step. GitHub didn't, and it shows in the results.

What This Changes For Developers

Faster suggestions mean less context switching. When Copilot takes 35% less time to respond, you stay in flow instead of waiting for the spinner. When throughput triples, the system handles more requests without degrading during peak usage. This matters for large teams where dozens of developers hit the same model simultaneously.

Higher retention means fewer edits. If 20% more of each suggestion stays in your final code, you're spending less time fixing Copilot's mistakes and more time writing new logic. The old model optimized for "looks good at first glance." The new one optimizes for "still there after code review."

Better FIM performance changes how you use completions. Mid-line edits, refactoring existing functions, inserting error handling into dense logic — these scenarios were hit-or-miss before. Now they work reliably. You can trust Copilot in more situations, which means you invoke it more often.

GitHub's evaluation process is worth copying. They use execution-based benchmarks (does it compile and pass tests?), LLM judges (is it readable and idiomatic?), language expert review (does it match community style?), and A/B testing (do developers actually keep it?). Most tools pick one signal. GitHub combines four, which is why their improvements hold up in production.

The reinforcement learning approach is particularly smart. Instead of just training on accepted suggestions, they train on accepted-and-retained suggestions. This filters out the "looks good, actually bad" completions that inflate acceptance metrics without helping developers. It's a small distinction with huge impact.

Try It Yourself

The new model is live across all GitHub Copilot environments. You don't need to opt in or change settings — it's already running.

To see the difference, try these scenarios where FIM matters most:

  • Insert error handling mid-function without duplicating surrounding code
  • Add a new parameter to an existing function and let Copilot update call sites
  • Refactor a block of logic while preserving the suffix
  • Complete a multi-line conditional where indentation and style must match exactly

Compare this to older Copilot behavior (or other tools) where mid-line completions often duplicate the prefix or overwrite the suffix. The new model respects boundaries.

If you're evaluating AI coding tools, GitHub's evaluation framework is worth studying. Their combination of offline benchmarks, LLM judges, expert review, and A/B testing is more rigorous than most. GitHub's offline evaluation approach for MCP servers uses similar principles.

The Bottom Line

Use this if you're already on Copilot and frustrated by suggestions that look good but need constant editing. The 20% retention improvement is real, and it compounds over time. Use this if you work in languages where FIM matters — refactoring-heavy codebases, dense logic, strict style guides.

Skip this if you're evaluating Copilot for the first time and expecting a fundamentally different tool. It's still Copilot. It's just better at the core job: completing code you'd actually keep.

The real risk here is complacency. GitHub's improvements are significant, but they're also table stakes. Cursor, Cline, and other tools are iterating fast. The gap between "best completions model" and "second best" is narrowing. GitHub's advantage is scale and integration — 10M repos, 600+ languages, tight VS Code integration. That moat holds for now, but it requires continuous improvement to maintain.

The opportunity is in domain-specific slices. GitHub mentions game engines, financial systems, ERP platforms. If they can fine-tune models for vertical-specific APIs and patterns, that's a defensible edge. Generic completions are becoming commoditized. Specialized completions are still wide open.

Source: GitHub Blog