GitHub Copilot Will Train on Your Code Unless You Opt Out

Starting April 24, GitHub will train AI models on Copilot Free, Pro, and Pro+ interaction data by default. Business and Enterprise users are unaffected. Here's what's collected, what's excluded, and whether you should opt out.

GitHub Copilot Will Train on Your Code Unless You Opt Out

TL;DR

  • Starting April 24, GitHub will train AI models on interaction data from Copilot Free, Pro, and Pro+ users by default
  • Business and Enterprise users are unaffected — their data stays private
  • You can opt out in settings, and previous opt-out preferences are preserved
  • Training data includes accepted code, inputs, context, file structure, and your feedback on suggestions

The Big Picture

GitHub just changed the default privacy posture for millions of developers. If you're on Copilot Free, Pro, or Pro+, your code interactions will train GitHub's AI models starting April 24 unless you explicitly opt out. This isn't a bug or a leak — it's policy.

The move mirrors what we've seen across the AI industry: OpenAI trains on ChatGPT conversations unless you disable it, Anthropic uses Claude interactions for safety research, and now GitHub is formalizing what many suspected was already happening. The difference here is scope. Copilot sees your cursor position, file structure, navigation patterns, and every keystroke you accept or reject. That's far more granular than a chat log.

GitHub frames this as necessary for model improvement, citing "meaningful improvements" from training on Microsoft employee data over the past year. They claim acceptance rates increased across multiple languages. The subtext: your code makes their models better, and better models theoretically help everyone. The question is whether you're comfortable being part of that feedback loop.

What Data Gets Collected

GitHub is specific about what they're taking. If you don't opt out, they collect:

  • Accepted and modified outputs — every suggestion you keep or tweak tells the model what works
  • Inputs and code snippets — what you send to Copilot, including the code shown to the model for context
  • Cursor context — the code surrounding your active position, which reveals patterns and intent
  • Comments and documentation — natural language that helps models understand developer reasoning
  • File names and repo structure — project organization and navigation habits
  • Feature interactions — whether you use chat, inline suggestions, or other Copilot modes
  • Feedback signals — thumbs up/down ratings on suggestions

What they explicitly exclude: interaction data from Business and Enterprise tiers, data from users who opt out, and content from issues or discussions. They also clarify that private repo code "at rest" isn't used, but code from private repos is processed when you actively use Copilot — and that interaction data is fair game for training unless you opt out.

The data may be shared with Microsoft and other GitHub affiliates, but not with third-party AI providers. That's a meaningful boundary, though it still means your code could inform models across Microsoft's ecosystem.

What This Changes For Developers

The practical impact depends on your tier and threat model. If you're on Copilot Business or Enterprise, nothing changes — your data was already excluded from training. If you're on Free, Pro, or Pro+, you now have a choice to make.

For individual developers working on open-source or side projects, the risk is probably low. Your code patterns and workflow habits will train models that might help you later. The trade-off is philosophical: are you comfortable contributing unpaid labor to a product you're already paying for?

For developers working on proprietary codebases or in regulated industries, this is a harder call. Even if you're not on a Business plan, you might be writing code that reveals competitive techniques, domain-specific algorithms, or compliance-sensitive patterns. GitHub says they won't use "at rest" private repo content, but the moment you invoke Copilot, that code becomes interaction data. If you're in fintech, healthcare, or defense, your compliance team will want to know about this.

The opt-out is straightforward — it's a toggle in your Copilot settings under "Privacy." If you previously opted out of data collection for product improvements, GitHub says your preference is preserved. That's good, but it also means the default for new users is opt-in, which shifts the burden to developers to actively protect their data.

This also raises questions about security tooling integration. If GitHub is training on interaction data that includes potential bugs and vulnerabilities, does that make their models better at catching issues, or does it risk encoding bad patterns? GitHub claims training on real-world data improves their ability to "help you catch potential bugs before they reach production," but the mechanism isn't clear.

The Industry Context

GitHub isn't alone here. Every major AI coding tool faces the same tension: models improve with more data, but developers don't want their code used without consent. Cursor, Cody, and other tools have similar policies, though the defaults vary.

What's notable is the timing. GitHub rolled this out after a year of training on Microsoft employee data, which suggests they've already validated the approach internally. The expansion to Free, Pro, and Pro+ users is a scale play — they need more diverse data to compete with models trained on broader codebases.

The other context is agent-driven development. As Copilot evolves from autocomplete to multi-step agents, the models need richer interaction data to understand workflows, not just syntax. That's why they're collecting navigation patterns and feature interactions, not just code snippets.

The Bottom Line

Opt out if you work on proprietary code, operate in a regulated industry, or simply don't want to contribute training data to a product you're paying for. The toggle is easy to find, and GitHub has preserved previous opt-out preferences.

Stay opted in if you're working on open-source projects, want to contribute to model improvement, and trust GitHub's data handling. The models will likely get better, and you'll benefit indirectly.

The real risk isn't data leakage — GitHub's boundaries around third-party sharing are clear. The risk is normalization. If every AI tool defaults to training on user data, developers lose agency over their work. This policy is legal, disclosed, and reversible, but it sets a precedent. If you care about that precedent, opt out now and make it clear that default consent isn't acceptable.

Source: GitHub Blog