Writing Effective Tools for AI Agents—Using AI Agents
Anthropic's internal playbook for building MCP tools that agents actually use well. Their secret? Let Claude Code optimize your tools against real evaluations—it beats human-written implementations.
TL;DR
- Tools for AI agents require fundamentally different design than traditional APIs—agents are non-deterministic and context-limited
- Build prototypes fast, run comprehensive evaluations, then let Claude Code optimize your tools automatically against real-world tasks
- Anthropic's internal testing shows Claude-optimized tools outperform human-written ones on held-out test sets
- Key principles: selective tool implementation, meaningful context over flexibility, token efficiency, and prompt-engineered descriptions
The Big Picture
The Model Context Protocol promises to give AI agents access to hundreds of tools. But there's a problem: most developers are building tools the same way they'd build APIs for other developers. That doesn't work.
Anthropic just published their internal playbook for building tools that agents actually use well. The core insight? Tools are a contract between deterministic systems and non-deterministic agents. When you ask Claude "Should I bring an umbrella today?", it might call a weather tool, answer from memory, ask for your location first, or hallucinate. Traditional software doesn't behave this way.
This changes everything about how you design tools. Instead of maximizing flexibility and coverage—the way you'd design a REST API—you need to optimize for agent affordances. Agents have limited context windows. They process information token-by-token. They can get confused by overlapping functionality. They need natural language identifiers, not UUIDs.
The surprising part? Anthropic's researchers found that Claude Code can write better tools than humans when given the right evaluation framework. Their internal Slack and Asana tools performed better after Claude optimized them—even beating "expert" implementations written by their own research team. This isn't about replacing developers. It's about collaborating with agents to build tools that other agents can use effectively.
How It Works
Anthropic's process has three phases: prototype, evaluate, optimize. Start by building a quick MCP server or Desktop extension. Use Claude Code to generate the initial implementation—feed it documentation from llms.txt files and relevant SDK docs. Connect it locally and test it yourself. Get user feedback. Build intuition around real use cases.
Then build an evaluation. This is the critical step most teams skip. Generate dozens of prompt-response pairs grounded in real-world complexity. Weak evaluation tasks look like "Schedule a meeting with jane@acme.corp next week." Strong tasks look like "Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room."
Run your evaluation programmatically with simple agentic loops—one while loop per task, alternating between LLM API calls and tool execution. Instruct agents to output reasoning and feedback blocks before tool calls. This triggers chain-of-thought behavior and helps you understand why agents choose certain tools. If you're using Claude, turn on interleaved thinking for this automatically.
Track more than just accuracy. Measure total runtime, number of tool calls, token consumption, and tool errors. Lots of redundant calls might mean you need better pagination. Lots of parameter errors might mean your tool descriptions are unclear. When Anthropic launched Claude's web search tool, they caught Claude needlessly appending "2025" to search queries—degrading results. They fixed it by improving the tool description.
Here's where it gets interesting: concatenate your evaluation transcripts and paste them into Claude Code. Claude analyzes the failures, identifies patterns, and refactors your tools. Anthropic used held-out test sets to verify this approach. Claude-optimized tools beat human-written implementations on internal Slack and Asana evaluations. The improvements came from subtle changes—better namespacing, clearer descriptions, more selective tool implementations.
The evaluation-driven loop is continuous. As you add tools or change implementations, re-run your evaluation. Let Claude analyze the new transcripts. Iterate. Anthropic's internal tools went through multiple rounds of this process, extracting performance gains each time.
What This Changes For Developers
Stop wrapping every API endpoint in a tool. That's the first lesson. More tools don't mean better outcomes. Agents have different affordances than traditional software. Computer memory is cheap; agent context is expensive. If you build a list_contacts tool that returns all contacts, the agent has to read through each one token-by-token. That's brute-force search. Humans don't work that way. Neither should your tools.
Build search_contacts or message_contact instead. Consolidate functionality. A schedule_event tool that finds availability and books the room is better than separate list_users, list_events, and create_event tools. A get_customer_context tool that compiles recent transactions, notes, and metadata is better than three separate retrieval tools.
This mirrors how context engineering works—give agents less information, but make it higher signal. Tools should return meaningful context, not technical flexibility. Skip the UUIDs, MIME types, and pixel dimensions. Return name, file_type, and image_url instead. Anthropic found that replacing alphanumeric UUIDs with natural language identifiers or even 0-indexed IDs significantly reduced hallucinations in retrieval tasks.
Namespacing matters more than you'd think. When agents have access to dozens of MCP servers and hundreds of tools, overlapping functionality causes confusion. Prefix-based namespacing like asana_search and jira_search helps agents select the right tool. So does resource-based namespacing like asana_projects_search and asana_users_search. Anthropic found non-trivial performance differences between prefix and suffix schemes—test both in your evaluations.
Token efficiency is non-negotiable. Implement pagination, range selection, filtering, or truncation for any tool that could return large responses. Claude Code restricts tool responses to 25,000 tokens by default. If you truncate, tell the agent what happened and how to refine the query. Error messages should be actionable, not opaque. Instead of returning a stack trace, return "The 'user_id' parameter must be a valid UUID. Example: 'a1b2c3d4-e5f6-7890-abcd-ef1234567890'."
Consider adding a response_format enum parameter to your tools. Let agents choose between "concise" and "detailed" responses. Detailed responses include technical IDs needed for downstream tool calls. Concise responses strip those out and return only high-signal information. Anthropic's Slack tools used ⅓ the tokens with concise responses while maintaining functionality for most tasks.
Try It Yourself
Here's a practical example of a response_format enum that controls tool verbosity:
enum ResponseFormat {
DETAILED = "detailed",
CONCISE = "concise"
}
interface SearchContactsParams {
query: string;
response_format?: ResponseFormat;
}
function search_contacts(params: SearchContactsParams) {
const results = performSearch(params.query);
if (params.response_format === ResponseFormat.DETAILED) {
return results.map(contact => ({
id: contact.uuid,
name: contact.name,
email: contact.email,
phone: contact.phone,
created_at: contact.created_at,
last_modified: contact.last_modified
}));
}
// Default to concise
return results.map(contact => ({
name: contact.name,
email: contact.email
}));
}To connect a local MCP server to Claude Code for testing:
claude mcp add my-tools node /path/to/server.jsFor the Claude Desktop app, navigate to Settings > Developer to add MCP servers or Settings > Extensions for Desktop extensions.
Check out Anthropic's tool evaluation cookbook for a complete end-to-end implementation of the evaluation process described in this article.
The Bottom Line
Use this approach if you're building MCP servers or tools for any AI agent system. The evaluation-driven workflow applies whether you're using Claude, GPT-4, or another model. Skip it if you're building one-off tools for a single task—the overhead isn't worth it.
The real opportunity here is the collaboration model. Letting Claude Code optimize your tools against real evaluations isn't just faster than manual iteration—it's often better. Anthropic's held-out test results prove this. But you need the evaluation infrastructure first. Without it, you're flying blind.
The risk is over-engineering. Don't build tools for every API endpoint. Don't add flexibility agents won't use. Start with 3-5 high-impact tools that match real workflows. Run your evaluation. Let Claude analyze the transcripts and suggest improvements. Iterate from there. The agents that use your tools will thank you—by actually using them correctly.
Source: Anthropic