How GitHub Tests MCP Server Quality: Offline Evaluation Deep Dive

GitHub built an automated pipeline to test how well LLMs select MCP tools and supply correct arguments. They treat tool selection as multi-class classification and track four argument-quality metrics to catch regressions before users see them.

How GitHub Tests MCP Server Quality: Offline Evaluation Deep Dive

TL;DR

  • GitHub built an automated evaluation pipeline to test how well LLMs select the right MCP tools and supply correct arguments
  • They treat tool selection as a multi-class classification problem, measuring accuracy, precision, recall, and F1-score
  • Four argument-quality metrics catch hallucinations, missing parameters, and value mismatches before users see them
  • This matters if you're building agents or MCP servers — small prompt changes can tank performance, and you need metrics to prove improvements

The Big Picture

Model Context Protocol (MCP) is the universal adapter that lets LLMs talk to APIs. GitHub's MCP Server powers GitHub Copilot workflows across the platform. But here's the problem: when you tweak a tool description, rename a parameter, or merge two similar tools, you're gambling. Will the model pick the right tool? Will it send the arguments in the correct format? Will it skip a step entirely?

GitHub's engineering team needed a way to ship MCP changes without breaking existing workflows. They built an offline evaluation pipeline that catches regressions before users see them. This isn't about vibes or manual testing. It's about treating tool selection as a classification problem and measuring argument quality with surgical precision.

The stakes are high. A vague tool description means the agent calls the wrong function. A missing parameter definition means the model hallucinates argument names. The outcome is a broken workflow and frustrated developers. Offline evaluation turns "this feels better" into measurable proof.

How It Works

The evaluation pipeline has three stages: fulfillment, evaluation, and summarization. Start with curated benchmarks. Each benchmark contains a natural language input, the expected tool call, and the expected arguments.

Example: "How many issues were created in the github/github-mcp-server repository during April 2025?" The expected tool is list_issues with arguments owner: github, repo: github-mcp-server, and since: 2025-04-01T00:00:00Z.

During fulfillment, the pipeline runs each benchmark across multiple models. The MCP host fetches the tool list from the server, passes it to the LLM along with the user request, and records which tools the model invoked and what arguments it supplied. This is raw output — no judgment yet.

Evaluation processes those outputs. GitHub treats tool selection as a multi-class classification problem. Each tool is a class. Each benchmark is labeled with the tool it expects. The pipeline computes accuracy, precision, recall, and F1-score.

Accuracy is straightforward: percentage of inputs that resulted in the expected tool call. Precision shows the proportion of correct calls out of all times the tool was invoked. Low precision means the model picks the tool even when it shouldn't. Recall shows the proportion of correct calls out of all times the tool was expected. Low recall means the model doesn't understand when to call the tool.

F1-score is the harmonic mean of precision and recall. It's the single number that tells you if the model is actually good at picking this tool.

Confusion matrices reveal which tools the model mixes up. GitHub had two tools — list_issues and search_issues — that models confused constantly. The confusion matrix showed list_issues being called in 30% of cases where search_issues was expected. That's a precision problem for list_issues and a recall problem for search_issues. The fix: rewrite the tool descriptions to sharpen the distinction.

Argument correctness is the second evaluation target. Selecting the right tool isn't enough if the model sends garbage arguments. GitHub tracks four metrics:

  • Argument hallucination: How often the model invents argument names that don't exist in the tool definition
  • All expected arguments provided: Whether every expected argument is present in the call
  • All required arguments provided: Whether all required arguments are included
  • Exact value match: Whether provided argument values match the expected values exactly

These metrics are computed only for tools that were correctly selected. No point checking arguments if the model picked the wrong tool entirely.

Summarization aggregates dataset-level statistics and produces the final report. This is what engineers review before merging a change. Did accuracy go up? Did precision drop for any tool? Did argument hallucination increase? The report answers these questions with numbers, not guesses.

What This Changes For Developers

If you're building MCP servers or agentic workflows, this evaluation approach is a blueprint. You can't ship tool changes based on vibes. You need metrics that prove improvements and catch regressions.

The classification framing is the key insight. Tool selection isn't a fuzzy problem. It's multi-class classification. You have classes (tools), labels (expected tool calls), and predictions (actual tool calls). Standard ML metrics apply. Accuracy, precision, recall, F1-score. Confusion matrices. These aren't exotic techniques. They're undergrad stats. But applying them to MCP evaluation is what makes this work.

Argument quality metrics are equally critical. Hallucination is the silent killer. The model picks the right tool, but invents an argument name that doesn't exist. The call fails. The user sees an error. You lose trust. Tracking hallucination rate per tool tells you which descriptions are confusing the model.

The feedback loop is fast. You change a tool description, run the evaluation pipeline, and get a report in minutes. No need to deploy to production and wait for bug reports. No need to manually test every tool combination. The pipeline does it for you.

This matters for GitHub Copilot because the MCP Server is the foundation. Every Copilot workflow that touches GitHub APIs goes through MCP. A regression in tool selection or argument handling breaks workflows for millions of developers. Offline evaluation is the safety net.

Try It Yourself

GitHub hasn't open-sourced the evaluation pipeline itself, but you can build a similar system using standard ML tooling. The core components are:

  • A benchmark dataset with inputs, expected tools, and expected arguments
  • A fulfillment script that runs benchmarks across multiple models and records outputs
  • An evaluation script that computes classification metrics and argument quality metrics
  • A summarization script that aggregates results into a report

For classification metrics, use scikit-learn's classification_report and confusion_matrix. For argument quality, write custom logic that compares expected vs. actual arguments. Track hallucinations by checking if argument names exist in the tool schema. Track completeness by checking if all expected arguments are present. Track exact match by comparing values.

The GitHub MCP Server is open source. You can inspect the tool definitions and see how they're structured. Study the descriptions. Notice how they're concise and specific. That's intentional. Vague descriptions confuse models. Tight descriptions improve precision and recall.

The Bottom Line

Use this approach if you're building MCP servers or agentic systems where tool selection matters. The classification framing and argument quality metrics are immediately applicable. Skip it if you're doing one-off prototypes or demos where regressions don't matter. The real opportunity here is turning MCP quality into a measurable, improvable property instead of a black box. The risk is shipping changes without evaluation and discovering regressions in production when users complain.

Source: GitHub Blog