copilot

Agent-Driven Development: How GitHub's Applied Science Team Ships Code

GitHub's Applied Science team shipped 11 agents and 28,858 lines of code in three days using agent-first development. Here's the workflow that made it possible.

TL;DR

GitHub's Applied Science team used Copilot CLI to build eval-agents, a tool for analyzing AI coding benchmarks
Five developers shipped 11 new agents, four skills, and 28,858 lines of code in three days using agent-first development
Key principles: treat Copilot like a junior engineer, prioritize refactoring and docs, implement CI/CD guardrails
If you're building with AI coding tools, this workflow will change how you think about architecture and collaboration

The Big Picture

A GitHub AI researcher just automated their intellectual work. Not the grunt work—the thinking part.

The task was analyzing coding agent performance across benchmarks like TerminalBench2 and SWEBench-Pro. Each benchmark run produces hundreds of trajectory files showing how agents solve problems. Multiply that across dozens of tasks and multiple daily runs, and you're staring at hundreds of thousands of lines of JSON to analyze manually.

The researcher's solution was to build eval-agents, a tool that uses AI agents to analyze AI agent performance. Meta, yes. But the real story isn't the tool itself—it's how they built it.

By treating GitHub Copilot as the primary contributor and structuring the codebase for agent-first development, the team unlocked something remarkable: five developers with no prior project experience shipped 11 new agents, four new skills, and a completely new workflow concept in under three days. That's 28,858 lines of code added across 345 files.

This isn't about Copilot being magic. It's about what happens when you apply good engineering principles—clean architecture, thorough documentation, robust testing—in an environment where an AI agent is doing most of the implementation work. The things we always knew mattered but never had time for suddenly become the most important work you can do.

How It Works

The setup is straightforward: Copilot CLI as the coding agent, Claude Opus 4.6 as the model, VSCode as the IDE. The team also leveraged the Copilot SDK to accelerate agent creation, which gave them access to existing tools, MCP servers, and the ability to register new skills without reinventing infrastructure.

The workflow revolves around three core principles that mirror how you'd work with a junior engineer.

Prompting Like You're Talking to a Human

Agents work best when you're conversational and verbose. Instead of terse commands, the researcher would dump stream-of-consciousness thinking into prompts and use planning mode before jumping into implementation.

Example prompt: "I've recently observed Copilot happily updating tests to fit its new paradigms even though those tests shouldn't be updated. How can I create a reserved test space that Copilot can't touch or must reserve to protect against regressions?"

This led to a conversation that resulted in guardrails similar to contract testing that only humans can update. The key insight: the things that make human engineers effective—context, clear thinking, planning before coding—make agents effective too.

Architecture and Docs Are Now Your Main Job

Remember all those refactors you wanted to do? The tests you never had time to write? The docs you wish existed when you onboarded? Those are now the highest-leverage work you can do.

When the codebase is clean, well-documented, and well-tested, delivering features with Copilot becomes trivial. The researcher spent most of their time refactoring names and file structures, documenting patterns, and adding test cases. They even cleaned up dead code that agents missed during implementation.

This work makes it easy for Copilot to navigate the codebase and understand patterns, just like it would for any other engineer. You can even ask, "Knowing what I know now, how would I design this differently?" and actually justify rearchitecting the whole project with Copilot's help.

Blame Process, Not Agents

The mindset shift from "trust but verify" to "blame process, not agents" mirrors how effective engineering teams operate. People make mistakes, so we build systems around that reality. Blameless culture provides psychological safety for teams to iterate and innovate.

Applying this to agent-driven development means adding processes and guardrails to prevent mistakes. When mistakes happen, you add more guardrails—better tests, clearer prompts—so the agent can't make the same mistake again.

Strict typing ensures the agent conforms to interfaces. Robust linters impose implementation rules. Integration, end-to-end, and contract tests—expensive to build manually—become much cheaper with agent assistance while giving you confidence that changes don't break existing features.

When Copilot has these tools in its development loop, it can check its own work. You're setting it up for success the same way you'd set up a junior engineer.

What This Changes For Developers

The development loop looks radically different when your codebase is set up for agent-driven development.

First, you plan a new feature with Copilot using /plan. You iterate on the plan, ensure testing is included, and make sure docs updates happen before code implementation. These docs serve as additional guidelines alongside your plan.

Second, you let Copilot implement the feature on /autopilot.

Third, you prompt Copilot to initiate a review loop with the Copilot Code Review agent. Something like: "Request Copilot Code Review, wait for the review to finish, address any relevant comments, and then re-request review. Continue this loop until there are no more relevant comments."

Fourth, human review. This is where you enforce the patterns and principles that keep the codebase agent-friendly.

Outside the feature loop, you run maintenance prompts early and often. Review code for missing tests, broken tests, and dead code. Look for duplication or opportunities for abstraction. Check documentation for gaps and update copilot-instructions.md to reflect changes.

The researcher runs these automatically once a week but often triggers them throughout the week as new features land. This maintains the agent-driven development environment.

The result is a workflow where the things that traditionally slow teams down—refactoring, documentation, test coverage—become the work that accelerates everything else. Four scientists with no prior project experience shipped 11 agents and four skills in three days because the codebase was structured for agent collaboration.

This approach also changes how you think about collaboration. When GitHub Copilot CLI's /fleet command runs multiple agents in parallel, having a well-structured, well-documented codebase becomes even more critical. Multiple agents working simultaneously need clear boundaries and patterns to avoid stepping on each other.

Try It Yourself

The fastest way to experience this workflow is to start with an existing repo and see how Copilot CLI can help you make it more agent-friendly.

# Install Copilot CLI
# Visit https://github.com/features/copilot/cli for installation

# Navigate to your repo
cd your-repo-path

# Activate Copilot CLI
copilot

# Ask Copilot to analyze your repo for agent-first improvements
/plan Read https://github.blog/ai-and-ml/github-copilot/agent-driven-development-in-copilot-applied-science/ and help me plan how I could best improve this repo for agent-first development

Start small. Pick one area of your codebase that needs refactoring or better documentation. Use /plan to work through the changes with Copilot, then let it implement on /autopilot. Run the code review loop. See how it feels.

The key is treating Copilot like a team member who needs onboarding, clear context, and guardrails. If something goes wrong, ask yourself what process or documentation would have prevented it, then add that to the codebase.

The Bottom Line

Use this approach if you're building tools with AI coding agents, maintaining open source projects, or working on teams where documentation and testing always get deprioritized. The workflow forces you to do the architectural work that makes codebases maintainable, and agents make that work pay off immediately.

Skip this if you're working solo on throwaway prototypes or in codebases where you're the only contributor. The overhead of maintaining agent-friendly documentation and tests won't pay off without collaboration.

The real opportunity here isn't just faster feature development. It's that the skills that make you a great engineer—clear communication, thoughtful design, robust testing—are the same skills that make you effective at building with AI agents. The technology is new. The principles aren't. And when you apply those principles consistently, you might just automate yourself into the most interesting work of your career.

Source: GitHub Blog