How GitHub Trained Copilot to Predict Your Next Edit
GitHub trained a custom model to predict your next code edit in real time. Here's why pull request data failed, how reinforcement learning fixed it, and what three major releases have changed since February.
TL;DR
- GitHub built a custom model for next edit suggestions (NES) because frontier models were either too slow or too inaccurate
- Training on pull request diffs failed — they needed real-time editing session data from internal volunteers
- Reinforcement learning with LLM-based graders improved the model by teaching it what NOT to suggest
- Three major releases since February have cut latency 24.5%, boosted acceptance 26.5%, and reduced annoying suggestions by 25.6%
The Big Picture
Predicting the next token is easy. Predicting the next edit is hard.
GitHub's next edit suggestions feature launched in February with a problem most AI coding tools haven't solved: anticipating what you'll change next without you asking. Not autocomplete. Not chat. A model that watches you code and jumps in with the exact refactor, fix, or cleanup you were about to make.
The catch? Frontier models couldn't do it. GPT-4 was too slow. Smaller models were fast but useless. So GitHub's team did what most vendors avoid: they trained a custom model from scratch, built a dataset that didn't exist, and co-designed the UX and training pipeline together.
This is the story of how they built it, why their first attempt failed, and what three major model updates have changed since launch. It's also a case study in why "AI-native" product development — where model training, prompting, and UX evolve together — produces better results than bolting a general-purpose LLM onto an existing feature.
Why Pull Request Data Doesn't Work
GitHub's first instinct was logical: train on internal pull request diffs. PRs contain edits. Edits are what the model needs to predict. Ship it.
Internal testing killed that idea fast. The model was timid. It refused to touch unfinished code. It hesitated to suggest changes to the line you were actively typing. It defaulted to doing nothing. In practice, it performed worse than a vanilla LLM with no fine-tuning.
The problem wasn't the model architecture. It was the data. Pull requests show the final state of code after review, not the messy, iterative process of writing it. They lack temporal ordering, so the model can't learn when changes happen. They contain almost no negative samples — cases where the correct action is "don't suggest anything." And they miss abandoned edits, in-progress rewrites, and all the other chaotic behavior that defines real coding sessions.
So GitHub reset. They built a custom dataset by recording real editing sessions from internal volunteers. Not diffs. Not commits. Actual keystroke-level data showing how developers write, rewrite, and abandon code in the editor.
That dataset became the foundation for every NES model since. Supervised fine-tuning on this data produced the first model to outperform vanilla LLMs. Data quality mattered more than volume — a smaller set of high-quality edit sessions beat a larger set of noisy PR diffs.
Reinforcement Learning: Teaching the Model What Not to Do
Supervised fine-tuning taught the model what a good edit looks like. But it couldn't teach the model what makes an edit bad. And it couldn't leverage the massive amount of unlabeled code that GitHub had access to.
Enter reinforcement learning. GitHub's team built an LLM-based grader that scores edit suggestions on correctness and UX. The grader doesn't just check if the edit is technically valid — it evaluates whether the diff is easy to read, whether the suggestion is distracting, and whether the model is being too eager or too passive.
The grading criteria evolve. The team routinely analyzes model outputs, identifies new patterns that indicate unhelpful edits, and updates the grader. This creates a feedback loop: the model generates suggestions, the grader scores them, and the model learns to avoid low-scoring patterns.
RL also expanded training to unsupervised data, which increased volume and diversity without requiring labeled ground truth. This forced the model to generalize better and prevented it from collapsing into simple, safe suggestions that work in common cases but fail on edge cases.
The result? The model learned to be more selective. It stopped suggesting edits just because it could. It started predicting when not to intervene.
Three Releases, Three Different Trade-Offs
Since February, GitHub has shipped three major NES updates. Each one balanced speed, precision, and developer tolerance for interruptions differently.
April: The first major update restructured the response format to require fewer tokens, which cut latency and improved suggestion quality. Acceptance rate jumped 10%. Hide rate (when developers manually dismiss suggestions) dropped 17.5%.
May: Developers complained the model was too eager. It suggested edits before they wanted them. So GitHub tuned the model to be more conservative. Shown rate dropped 18.8%, but acceptance rate spiked 23.2%. Fewer suggestions, but the ones that appeared were more useful.
November: After testing nearly thirty candidate models over the summer — none of which beat the May release — GitHub finally shipped an update that cleared the A/B testing bar. Shown rate dropped another 24.5%. Acceptance rate climbed to 26.5%. Hide rate fell 25.6%. The model got faster, smarter, and less annoying all at once.
The November release achieved this through prompt optimization (shorter prompts, more token caching), data quality filtering (LLM-based graders removed ambiguous samples), synthetic data distillation (training a smaller model on outputs from a larger one), and hyperparameter tuning for the new base architecture.
Every model candidate goes through three evaluation stages: offline testing on targeted scenarios, internal dogfooding by GitHub and Microsoft engineers, and A/B experiments on a small percentage of real-world requests. Only models that beat production on acceptance, hide, and latency metrics get shipped.
What Developers Actually Want
Developer feedback has driven almost every change to NES. Some developers want the model to be more aggressive — jump in immediately, suggest continuously. Others want it to be more restrained — only intervene when it's obvious what comes next.
There's no universal preference. Like tabs versus spaces, "helpful" looks different depending on the developer.
So far, GitHub has focused on a default experience that works for most people. But that balance has shifted over time based on real usage patterns. Early releases were too eager. Recent releases are more selective. The next frontier is adaptive behavior — a model that learns your editing style and adjusts its aggressiveness based on whether you accept, dismiss, or ignore suggestions.
That work is ongoing. GitHub's team is also exploring edits at a distance (suggestions across multiple files, not just where you're typing), faster responses (continued latency improvements), and smarter edits (better anticipation of context and cross-file dependencies).
If you have feedback on NES, GitHub wants to hear it. File an issue in the VS Code repository or submit feedback directly through the editor.
Try It Yourself
Next edit suggestions are available now in VS Code with the Copilot Chat extension. Make sure you're running the latest version of both, then enable NES in your VS Code settings.
The feature runs in the background as you edit. You don't invoke it. It watches your code, predicts what you'll change next, and surfaces suggestions inline. Accept with Tab. Reject with Escape. Or ignore it and keep typing.
If you're already using GitHub Copilot agents, NES integrates seamlessly. It's not a separate tool. It's part of the same system, trained to anticipate edits while agents handle higher-level tasks.
The Bottom Line
Use this if you're already in the GitHub Copilot ecosystem and you edit code in VS Code. The model is fast, unobtrusive, and genuinely useful for repetitive refactors and cleanup tasks. Skip it if you're not a VS Code user or if you prefer explicit control over every suggestion — NES is designed to intervene automatically, and some developers find that distracting. The real opportunity here is the training approach: custom datasets, RL-based grading, and co-designed UX. That's the playbook for building AI features that don't feel bolted on. The risk is that adaptive behavior could make the model feel unpredictable if GitHub doesn't nail the personalization layer.
Source: GitHub Blog