anthropic

Anthropic Just Fixed the Three Biggest Problems With AI Tool Use

Anthropic shipped Tool Search Tool, Programmatic Tool Calling, and Tool Use Examples — three features that cut token overhead by 85%, eliminate inference passes on intermediate results, and teach correct API usage through examples instead of schemas.

TL;DR

Tool Search Tool cuts token overhead by 85% — load tools on-demand instead of stuffing 50+ definitions upfront
Programmatic Tool Calling lets Claude orchestrate tools through Python code, eliminating inference overhead and keeping intermediate results out of context
Tool Use Examples teach correct API usage through concrete samples, not just JSON schemas
These features are production-ready and already powering Claude for Excel's spreadsheet manipulation

The Big Picture

AI agents hit a wall when they need to work with dozens or hundreds of tools. Load all your MCP servers upfront and you're burning 50K+ tokens before Claude reads a single request. Call tools one at a time through natural language and you're paying for full inference passes on every intermediate result. Rely on JSON schemas alone and Claude guesses at parameter formats, breaking your API calls.

Anthropic just shipped three features that solve these bottlenecks: Tool Search Tool for on-demand discovery, Programmatic Tool Calling for code-based orchestration, and Tool Use Examples for teaching correct usage patterns. These aren't incremental improvements — they're architectural changes that make previously impossible workflows practical.

The proof is already shipping. Claude for Excel uses Programmatic Tool Calling to manipulate spreadsheets with thousands of rows without overloading context. Internal testing shows Tool Search Tool improving accuracy from 49% to 74% on Opus 4 when working with large tool libraries. Tool Use Examples pushed complex parameter handling from 72% to 90% accuracy.

This matters because building effective agents means handling real-world scale: IDE assistants that integrate git, package managers, testing frameworks, and deployment pipelines simultaneously. Operations coordinators that connect Slack, GitHub, Google Drive, Jira, and company databases. These systems need hundreds of tools available without paying the token cost upfront.

How It Works

Tool Search Tool: Stop Loading Everything Upfront

The traditional approach loads all tool definitions into Claude's context at the start of every conversation. Connect five MCP servers and you're looking at 55K tokens consumed before any work begins. GitHub alone costs 26K tokens for 35 tools. Add Slack (21K), Jira (17K), and a few others and you're approaching 100K+ token overhead. Anthropic's seen tool definitions hit 134K tokens before optimization.

Tool Search Tool flips this model. You mark tools with defer_loading: true in your API call. Claude only sees the Tool Search Tool itself plus any critical tools you explicitly keep loaded. When Claude needs specific capabilities, it searches for them. The search returns references to matching tools, which then get expanded into full definitions.

The token savings are dramatic. A 50+ tool MCP setup that previously consumed 77K tokens upfront now uses 8.7K — a 95% context preservation. That's an 85% reduction in token usage while maintaining access to your full tool library.

The implementation is straightforward. Include a search tool (regex or BM25-based, both provided out of the box), then mark tools for deferred loading:

{
  "tools": [
    {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},
    {
      "name": "github.createPullRequest",
      "description": "Create a pull request",
      "input_schema": {...},
      "defer_loading": true
    }
  ]
}

For entire MCP servers, defer the whole server while keeping high-use tools loaded:

{
  "type": "mcp_toolset",
  "mcp_server_name": "google-drive",
  "default_config": {"defer_loading": true},
  "configs": {
    "search_files": {"defer_loading": false}
  }
}

Tool Search Tool doesn't break prompt caching because deferred tools are excluded from the initial prompt entirely. They're only added after Claude searches for them, so your system prompt and core tool definitions remain cacheable.

The accuracy improvements are significant. Opus 4 jumped from 49% to 74% on MCP evaluations with large tool libraries. Opus 4.5 improved from 79.5% to 88.1%. The biggest gains come from reducing tool selection errors — when you have tools named notification-send-user and notification-send-channel, loading only relevant tools eliminates confusion.

Programmatic Tool Calling: Orchestrate Through Code, Not Inference

Traditional tool calling creates two problems at scale. First, every tool result enters Claude's context whether it's useful or not. Analyze a 10MB log file and the entire file pollutes your context window, even though you only need error frequency counts. Second, each tool call requires a full inference pass. A five-tool workflow means five inference cycles plus Claude manually parsing each result, comparing values, and synthesizing conclusions through natural language.

Programmatic Tool Calling lets Claude write Python code that orchestrates tools directly. Instead of requesting tools one at a time with results returning to context, Claude generates a script that calls multiple tools, processes outputs, and controls what information actually reaches its context window.

Consider a budget compliance check: "Which team members exceeded their Q3 travel budget?" You have tools for fetching team members, expenses, and budget limits. The traditional approach fetches 20 team members, makes 20 expense calls returning 50-100 line items each, fetches budget limits, then dumps 2,000+ expense line items (50KB+) into Claude's context for manual summation and comparison.

With Programmatic Tool Calling, Claude writes orchestration code that runs in a sandboxed environment:

team = await get_team_members("engineering")

# Fetch budgets for each unique level
levels = list(set(m["level"] for m in team))
budget_results = await asyncio.gather(*[
    get_budget_by_level(level) for level in levels])
budgets = {level: budget for level, budget in zip(levels, budget_results)}

# Fetch all expenses in parallel
expenses = await asyncio.gather(*[
    get_expenses(m["id"], "Q3") for m in team])

# Find employees who exceeded their travel budget
exceeded = []
for member, exp in zip(team, expenses):
    budget = budgets[member["level"]]
    total = sum(e["amount"] for e in exp)
    if total > budget["travel_limit"]:
        exceeded.append({
            "name": member["name"],
            "spent": total,
            "limit": budget["travel_limit"]
        })

print(json.dumps(exceeded))

Claude's context receives only the final result: the two or three people who exceeded their budget. The 2,000+ line items, intermediate sums, and budget lookups never touch Claude's context. Token consumption drops from 200KB of raw expense data to 1KB of results.

The efficiency gains compound. Token usage dropped 37% on complex research tasks (43,588 to 27,297 tokens average). Latency improvements are substantial because you eliminate 19+ inference passes when Claude orchestrates 20+ tool calls in a single code block. Accuracy improved on internal knowledge retrieval (25.6% to 28.5%) and GIA benchmarks (46.5% to 51.2%).

To enable it, add code execution to your tools and opt-in specific tools with allowed_callers:

{
  "tools": [
    {
      "type": "code_execution_20250825",
      "name": "code_execution"
    },
    {
      "name": "get_team_members",
      "description": "Get all members of a department...",
      "input_schema": {...},
      "allowed_callers": ["code_execution_20250825"]
    }
  ]
}

When the code calls a tool, you receive a request with a caller field indicating it's being invoked from code execution. You return the result, which gets processed in the sandbox rather than Claude's context. This request-response cycle repeats for each tool call. When the code finishes, only the final output enters Claude's context.

Tool Use Examples: Teach Usage Patterns, Not Just Schemas

JSON schemas define structure — types, required fields, allowed enums — but they can't express usage patterns. When should you include optional parameters? Which combinations make sense? What conventions does your API expect?

A support ticket API might have a schema with title, priority, labels, nested reporter objects with contact info, due_date, and escalation settings. The schema defines what's valid, but leaves critical questions unanswered: Should due_date use "2024-11-06" or "Nov 6, 2024"? Is reporter.id a UUID or "USR-12345"? When should Claude populate nested contact info? How do escalation levels correlate with priority?

Tool Use Examples let you provide sample tool calls directly in your tool definitions:

{
  "name": "create_ticket",
  "input_schema": { /* schema */ },
  "input_examples": [
    {
      "title": "Login page returns 500 error",
      "priority": "critical",
      "labels": ["bug", "authentication", "production"],
      "reporter": {
        "id": "USR-12345",
        "name": "Jane Smith",
        "contact": {
          "email": "jane@acme.com",
          "phone": "+1-555-0123"
        }
      },
      "due_date": "2024-11-06",
      "escalation": {
        "level": 2,
        "notify_manager": true,
        "sla_hours": 4
      }
    },
    {
      "title": "Add dark mode support",
      "labels": ["feature-request", "ui"],
      "reporter": {
        "id": "USR-67890",
        "name": "Alex Chen"
      }
    },
    {
      "title": "Update API documentation"
    }
  ]
}

From these three examples, Claude learns format conventions (YYYY-MM-DD dates, USR-XXXXX IDs, kebab-case labels), nested structure patterns (how to construct reporter objects with contact info), and optional parameter correlations (critical bugs get full contact + escalation with tight SLAs, feature requests get reporter but no contact, internal tasks are title-only).

Internal testing showed accuracy improvements from 72% to 90% on complex parameter handling. The examples add tokens to your tool definitions, so use them strategically: complex nested structures, tools with many optional parameters where inclusion patterns matter, APIs with domain-specific conventions not captured in schemas.

What This Changes For Developers

These features unlock workflows that weren't practical before. IDE assistants can now integrate dozens of tools — git operations, file manipulation, package managers, testing frameworks, deployment pipelines — without burning context on unused definitions. Operations coordinators can connect Slack, GitHub, Google Drive, Jira, and company databases simultaneously, discovering tools on-demand as tasks require them.

The architectural shift is significant. Instead of treating tool use as simple function calling, you're building systems with intelligent orchestration. Dynamic discovery means your agent scales to hundreds of tools without linear token costs. Code-based execution means complex workflows run efficiently without inference overhead on every step. Concrete examples mean your APIs get called correctly the first time.

The features are complementary but you don't need all three for every task. Start with your biggest bottleneck. Context bloat from tool definitions? Tool Search Tool. Large intermediate results polluting context? Programmatic Tool Calling. Parameter errors and malformed calls? Tool Use Examples. Then layer additional features as needed.

For Tool Search Tool, write clear, descriptive tool definitions since search matches against names and descriptions. Keep your three to five most-used tools always loaded, defer the rest. Add system prompt guidance so Claude knows what's available: "You have access to tools for Slack messaging, Google Drive file management, Jira ticket tracking, and GitHub repository operations. Use the tool search to find specific capabilities."

For Programmatic Tool Calling, document return formats clearly since Claude writes code to parse tool outputs. Opt-in tools that benefit from programmatic orchestration: operations that can run in parallel, idempotent operations safe to retry, tools that return large datasets where you only need aggregates.

For Tool Use Examples, craft examples for behavioral clarity. Use realistic data (real city names, plausible prices, not "string" or "value"). Show variety with minimal, partial, and full specification patterns. Keep it concise — one to five examples per tool. Focus on ambiguity and only add examples where correct usage isn't obvious from schema.

Try It Yourself

These features are available in beta. Enable them with the beta header and include the tools you need:

client.beta.messages.create(
    betas=["advanced-tool-use-2025-11-20"],
    model="claude-sonnet-4-5-20250929",
    max_tokens=4096,
    tools=[
        {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},
        {"type": "code_execution_20250825", "name": "code_execution"},
        # Your tools with defer_loading, allowed_callers, and input_examples
    ]
)

Anthropic provides detailed documentation and cookbooks for each feature:

The cookbooks include runnable examples for common patterns: embedding-based tool search, parallel tool execution with async/await, and example-driven parameter handling.

The Bottom Line

Use Tool Search Tool if you're connecting 10+ tools or multiple MCP servers and hitting context limits. The 85% token reduction and accuracy improvements justify the added search step. Skip it if you have fewer than 10 tools or all tools are used frequently in every session.

Use Programmatic Tool Calling if you're processing large datasets where you only need aggregates, running multi-step workflows with three or more dependent tool calls, or handling tasks where intermediate data shouldn't influence Claude's reasoning. Skip it for simple single-tool invocations or tasks where Claude should see all intermediate results.

Use Tool Use Examples if you have complex nested structures, many optional parameters where inclusion patterns matter, or APIs with domain-specific conventions not captured in schemas. Skip them for simple single-parameter tools with obvious usage.

The real opportunity is combining these features strategically. An MCP-powered IDE assistant might use Tool Search Tool to discover git and deployment tools on-demand, Programmatic Tool Calling to orchestrate multi-step test runs without polluting context, and Tool Use Examples to ensure correct API parameter formatting. That combination wasn't practical before — now it's production-ready.

Source: Anthropic