copilot

GitHub Copilot's Tool Selection Fix: Faster Responses, Smarter Routing

GitHub cut Copilot's default toolset from 40 to 13 tools and built embedding-guided routing. The result: 400ms faster responses and 94.5% tool coverage. Here's how they did it—and why more tools often means worse performance.

TL;DR

GitHub cut Copilot's default toolset from 40 to 13 core tools, reducing response latency by 400ms on average
New embedding-guided routing achieves 94.5% tool coverage vs 69% with the old static approach
Adaptive tool clustering groups MCP tools dynamically, preventing model context overflow
Matters for anyone using Copilot Chat in VS Code—especially if you've seen that "Optimizing tool selection..." spinner

The Big Picture

If you've used GitHub Copilot Chat in VS Code and watched it spin on "Optimizing tool selection..." for several seconds, you've experienced the paradox of choice at the model level. GitHub's agent can access hundreds of tools through the Model Context Protocol—everything from codebase analyzers to Azure utilities. More tools should mean more capability. Instead, it often means slower responses and worse decisions.

The problem isn't unique to Copilot. As agentic systems proliferate, developers are discovering that throwing every possible tool at an LLM doesn't make it smarter. It makes it confused. The model burns tokens evaluating irrelevant options, misses cache hits, and sometimes just picks the wrong tool because the search space is too large. GitHub's solution is counterintuitive: give the model fewer choices upfront, but make those choices smarter.

The engineering team built two new systems—embedding-guided tool routing and adaptive tool clustering—and trimmed the default toolset to 13 core tools. The result: 2-5 percentage point improvements on SWE-Lancer and SWEbench-Verified benchmarks, and 400ms faster average response times in production A/B tests. This isn't just a performance tweak. It's a blueprint for how to build agents that scale without drowning in their own capabilities.

How It Works

The core insight is that tool selection is a semantic search problem, not a reasoning problem. When you ask Copilot to "fix this bug and merge it into the dev branch," the model doesn't need to evaluate all 40+ tools. It needs to find the merge tool inside the GitHub MCP group. But with the old approach, the model would explore search tools, then documentation tools, then local Git tools—each lookup adding latency and a chance of failure—before finally landing on the right one.

GitHub's fix has three parts. First, they use their internal Copilot embedding model to generate vector representations of every tool. Similar tools cluster together in embedding space based on cosine similarity. This is deterministic and fast—no LLM calls required for the clustering itself. They still use a model to summarize each cluster, but that's a single cheap call per group, not a sprawling categorization task that sometimes forgets tools mid-stream.

Second, they introduced "virtual tools"—functional groupings that act like directories. The model sees high-level categories (Jupyter Notebook Tools, Web Interaction Tools, etc.) instead of dozens of individual tool names. If it needs something specific, it expands the relevant group. This reduces the initial context window and improves cache hit rates, since related tools tend to be used together.

Third, and most important, is embedding-guided routing. Before the model even starts reasoning, the system compares the query embedding against all tool embeddings and pre-selects the most semantically relevant candidates. If your query mentions "merge" and "branch," the GitHub merge tool gets surfaced immediately—no exploratory calls needed. This is where the big performance gains come from.

The team measures success with "Tool Use Coverage"—how often the model already has the right tool visible when it needs it. The embedding approach hit 94.5% coverage in benchmarks, compared to 87.5% for LLM-based selection and 69% for the old static list. That's a 27.5 percentage point absolute improvement. In production, 72% of tool calls in the Insiders build were successfully pre-expanded, versus just 19% in the old Stable build.

The toolset reduction is equally important. GitHub analyzed usage stats and performance data to identify 13 essential tools: repository structure parsing, file reading and editing, context search, terminal access. Everything else got grouped into virtual categories. This wasn't arbitrary pruning—they observed a 2-5 point drop in resolution rates on SWE-Lancer when the agent had access to the full 40-tool set. Too many options made the model ignore instructions, use tools incorrectly, or call unnecessary ones.

With the smaller core set, Time To First Token dropped by 190ms on average, and total response time (Time To Final Token) dropped by 400ms. The model reasons faster because it has less to reason about. Simple, but not obvious until you measure it.

What This Changes For Developers

If you're building on MCP or any multi-tool agent system, this is a warning: more tools will hurt you before they help you. The instinct is to expose everything—let the model figure it out. But models don't "figure it out" efficiently at scale. They thrash.

For Copilot users, the changes are already rolling out. You'll notice faster responses, especially on complex queries that used to trigger that spinner. The agent will also make fewer mistakes—less tool misuse, fewer ignored instructions. GitHub's data shows this isn't just a latency win; it's a correctness win.

The broader lesson is about agent architecture. Tool selection isn't a reasoning task you should offload entirely to the LLM. It's a retrieval task you can optimize with embeddings and semantic search. GitHub's approach—cluster tools, route with embeddings, surface only the top candidates—is generalizable. If you're building agents that call APIs, query databases, or interact with external systems, you can apply the same pattern.

There's also a cache efficiency angle. By grouping related tools and pre-expanding likely candidates, GitHub reduces cache misses. The model sees the same tool clusters across similar queries, which means better KV cache reuse and lower inference costs. This matters more as context windows grow and agents handle longer sessions.

One caveat: this approach assumes your tools have good semantic descriptions. If your tool names and docstrings are vague or inconsistent, embeddings won't save you. GitHub's success here depends on clear, descriptive tool metadata. That's table stakes for any agent system, but it's worth emphasizing.

Try It Yourself

The changes are live in GitHub Copilot for VS Code. If you're on the Insiders build, you're already seeing the new routing. To test the difference, try a query that requires a specific tool buried in a large MCP server:

"Create a pull request from this branch and request review from the team"

Watch how quickly Copilot surfaces the GitHub PR tools. Compare that to a few weeks ago, when it might have explored file tools or search tools first. The latency difference is noticeable, especially on slower connections or larger codebases.

If you're building your own MCP server or agent system, the takeaway is to instrument tool usage. Track which tools get called, which queries trigger which tools, and where the model wastes time exploring dead ends. GitHub's approach started with usage data—they couldn't have trimmed the toolset without knowing which tools actually mattered.

The Bottom Line

Use this if you're building multi-tool agents or working with MCP servers in production. The embedding-guided routing pattern is proven and generalizable. Skip the temptation to expose every possible tool upfront—it will slow you down and hurt accuracy. The real opportunity here is rethinking tool selection as a retrieval problem, not a reasoning problem. GitHub's data shows that semantic search beats LLM-based selection by 7 percentage points in coverage, and it's orders of magnitude faster.

The risk is assuming your LLM will "just figure it out" when you hand it 100+ tools. It won't. It'll thrash, miss cache, and pick the wrong tool often enough to degrade user experience. If you're seeing latency spikes or weird tool misuse in your agent, this is probably why. Trim the default set, cluster the rest, and route with embeddings. GitHub's results speak for themselves: 400ms faster, 2-5 points better on benchmarks, and 72% pre-expansion success in production.

For more on how GitHub approaches agent safety and architecture, see their agentic security model.

Source: GitHub Blog