anthropic

Claude Can Now Search 1,000+ Tools Without Blowing Its Context

Anthropic shipped Tool Search Tool, Programmatic Tool Calling, and Tool Use Examples. Tool Search cuts token usage 85% by discovering tools on-demand. Programmatic Calling orchestrates workflows in code, eliminating inference overhead. Examples teach correct usage patterns schemas can't express.

TL;DR

Tool Search Tool lets Claude discover tools on-demand instead of loading all definitions upfront — 85% token reduction, 95% context preservation
Programmatic Tool Calling moves orchestration into code execution, eliminating inference overhead and keeping intermediate results out of context — 37% token savings on complex workflows
Tool Use Examples teach Claude correct parameter usage through concrete samples, improving accuracy from 72% to 90% on complex APIs
If you're building MCP-powered agents or multi-tool workflows, these features solve the context bloat and accuracy problems that kill production agents

The Big Picture

Production AI agents die from context bloat. You connect five MCP servers — GitHub, Slack, Sentry, Grafana, Splunk — and suddenly 55,000 tokens vanish before Claude reads your first request. Add Jira and you're at 72,000 tokens of tool definitions alone. The model hasn't done any work yet.

Anthropic just shipped three features that fundamentally change how Claude handles tools. Tool Search Tool discovers capabilities on-demand instead of front-loading every definition. Programmatic Tool Calling orchestrates multi-step workflows in code rather than burning an inference pass per tool invocation. Tool Use Examples show Claude correct usage patterns that JSON schemas can't express.

These aren't incremental improvements. Internal testing showed Opus 4 jumping from 49% to 74% accuracy on MCP evaluations with Tool Search enabled. Programmatic Tool Calling cut token usage by 37% on research tasks while improving knowledge retrieval from 25.6% to 28.5%. Tool Use Examples pushed complex parameter handling from 72% to 90% accuracy.

The real unlock is scale. An IDE assistant that coordinates git, file operations, package managers, test frameworks, and deployment pipelines. An operations coordinator connecting Slack, GitHub, Google Drive, Jira, company databases, and dozens of MCP servers simultaneously. These workflows were impossible when every tool definition consumed context upfront and every invocation required a full inference pass.

How Tool Search Tool Works

Traditional tool use loads all definitions into Claude's context immediately. Five MCP servers with 58 tools consume 55,000 tokens before the conversation starts. At scale, Anthropic saw tool definitions hitting 134,000 tokens.

Tool Search Tool flips this model. You provide all tool definitions to the API but mark them with defer_loading: true. Deferred tools don't enter Claude's context initially. Claude only sees the Tool Search Tool itself plus any critical tools you explicitly keep loaded.

When Claude needs specific capabilities, it searches for relevant tools. The search returns references to matching tools, which then get expanded into full definitions. If Claude needs GitHub operations, only github.createPullRequest and github.listIssues load — not your 50+ tools from Slack, Jira, and Google Drive.

The token savings are dramatic. A 50+ tool MCP setup that previously consumed 77,000 tokens upfront now uses 8,700 tokens — preserving 95% of the context window. That's an 85% reduction in token usage while maintaining access to your full tool library.

Anthropic provides regex-based and BM25-based search tools out of the box. You can also implement custom search using embeddings or other strategies. For MCP servers, you can defer entire servers while keeping specific high-use tools loaded:

{
  "type": "mcp_toolset",
  "mcp_server_name": "google-drive",
  "default_config": {"defer_loading": true},
  "configs": {
    "search_files": {"defer_loading": false}
  }
}

The feature adds a search step before tool invocation, so it delivers ROI when context savings and accuracy improvements outweigh the latency cost. Use it when tool definitions exceed 10,000 tokens, when you're experiencing tool selection errors, or when building MCP-powered systems with multiple servers. Skip it for small tool libraries under 10 tools where everything gets used frequently.

One critical detail: Tool Search Tool doesn't break prompt caching. Deferred tools are excluded from the initial prompt entirely, so your system prompt and core tool definitions remain cacheable. Tools only enter context after Claude searches for them.

How Programmatic Tool Calling Works

Traditional tool calling creates two bottlenecks. First, every tool result enters Claude's context whether it's useful or not. Analyzing a 10MB log file? The entire file lands in context even though Claude only needs error frequency summaries. Fetching customer data across multiple tables? Every record accumulates in context regardless of relevance.

Second, each tool call requires a full inference pass. Claude requests a tool, waits for the result, parses it through natural language processing, reasons about what to do next, then requests another tool. A five-tool workflow means five inference passes plus Claude manually synthesizing each result.

Programmatic Tool Calling moves orchestration into code execution. Instead of requesting tools one at a time through the API, Claude writes Python code that calls multiple tools, processes their outputs, and controls what information actually enters its context window.

Consider a budget compliance check: "Which team members exceeded their Q3 travel budget?" You have tools for fetching team members, expenses, and budget limits. Traditional approach fetches 20 team members, makes 20 expense calls returning 2,000+ line items, fetches budget limits, dumps everything into Claude's context (50KB+), then has Claude manually sum expenses and compare against budgets.

With Programmatic Tool Calling, Claude writes a script that orchestrates the entire workflow. The script runs in a sandboxed Code Execution environment, pausing when it needs results from your tools. When you return tool results via the API, they're processed by the script rather than consumed by the model. Claude only sees the final output — the two or three people who exceeded their budget. The 2,000+ line items never touch Claude's context.

The efficiency gains compound. Token consumption dropped from 43,588 to 27,297 on complex research tasks — a 37% reduction. Latency improved because you eliminate 19+ inference passes when Claude orchestrates 20+ tool calls in a single code block. Accuracy improved because explicit orchestration logic makes fewer errors than juggling multiple tool results in natural language. Knowledge retrieval jumped from 25.6% to 28.5%, and GIA benchmarks improved from 46.5% to 51.2%.

Implementation requires opting tools into programmatic execution with allowed_callers:

{
  "tools": [
    {
      "type": "code_execution_20250825",
      "name": "code_execution"
    },
    {
      "name": "get_team_members",
      "description": "Get all members of a department",
      "input_schema": {...},
      "allowed_callers": ["code_execution_20250825"]
    }
  ]
}

The API converts these tool definitions into Python functions Claude can call directly. When the code calls get_expenses(), you receive a tool request with a caller field indicating it came from code execution. You provide the result, which gets processed in the Code Execution environment rather than Claude's context.

Programmatic Tool Calling adds a code execution step, so it pays off when token savings and latency improvements are substantial. Use it when processing large datasets where you only need aggregates, running multi-step workflows with three or more dependent tool calls, or handling tasks where intermediate data shouldn't influence Claude's reasoning. Skip it for simple single-tool invocations or quick lookups with small responses.

Anthropic's Claude for Excel uses Programmatic Tool Calling to read and modify spreadsheets with thousands of rows without overloading context. That's the kind of workflow these features enable.

How Tool Use Examples Work

JSON Schema defines structure — types, required fields, allowed enums — but it can't express usage patterns. When should you include optional parameters? Which combinations make sense? What conventions does your API expect?

Consider a support ticket API with fields for title, priority, labels, reporter (with nested contact info), due date, and escalation settings. The schema defines what's valid, but leaves critical questions unanswered. Should due_date use "2024-11-06", "Nov 6, 2024", or ISO 8601? Is reporter.id a UUID or "USR-12345"? When should Claude populate nested reporter.contact? How do escalation.level and escalation.sla_hours relate to priority?

Tool Use Examples let you provide sample tool calls directly in your tool definitions. Instead of relying on schema alone, you show Claude concrete usage patterns:

{
  "name": "create_ticket",
  "input_schema": {...},
  "input_examples": [
    {
      "title": "Login page returns 500 error",
      "priority": "critical",
      "labels": ["bug", "authentication", "production"],
      "reporter": {
        "id": "USR-12345",
        "name": "Jane Smith",
        "contact": {
          "email": "jane@acme.com",
          "phone": "+1-555-0123"
        }
      },
      "due_date": "2024-11-06",
      "escalation": {
        "level": 2,
        "notify_manager": true,
        "sla_hours": 4
      }
    },
    {
      "title": "Add dark mode support",
      "labels": ["feature-request", "ui"],
      "reporter": {
        "id": "USR-67890",
        "name": "Alex Chen"
      }
    },
    {
      "title": "Update API documentation"
    }
  ]
}

From these three examples, Claude learns format conventions (YYYY-MM-DD dates, USR-XXXXX IDs, kebab-case labels), nested structure patterns (how to construct reporter with nested contact), and optional parameter correlations (critical bugs get full contact info plus escalation with tight SLAs, feature requests get reporter but no contact, internal tasks get title only).

Internal testing showed accuracy improving from 72% to 90% on complex parameter handling. Tool Use Examples add tokens to your definitions, so they're most valuable when accuracy improvements outweigh the cost. Use them for complex nested structures, tools with many optional parameters where inclusion patterns matter, APIs with domain-specific conventions not captured in schemas, or similar tools where examples clarify which one to use. Skip them for simple single-parameter tools or standard formats Claude already understands.

What This Changes For Developers

These features solve the fundamental constraints that kill production agents. Context bloat from tool definitions. Inference overhead from sequential tool calls. Parameter errors from ambiguous schemas. Each feature targets a specific bottleneck.

The architectural shift is significant. Instead of front-loading every capability and burning inference passes for orchestration, agents now discover tools on-demand and execute workflows programmatically. This enables workflows that were previously impossible at scale.

An IDE assistant coordinating git operations, file manipulation, package managers, testing frameworks, and deployment pipelines can now access hundreds of tools without consuming 100,000+ tokens upfront. An operations coordinator connecting Slack, GitHub, Google Drive, Jira, company databases, and dozens of MCP servers can discover capabilities dynamically and orchestrate complex workflows without polluting context with intermediate results.

The features are complementary. Tool Search Tool ensures the right tools are found. Programmatic Tool Calling ensures efficient execution. Tool Use Examples ensure correct invocation. You don't need all three for every task — start with your biggest bottleneck and layer additional features as needed.

For MCP-powered systems, the impact is immediate. Multiple servers with dozens of tools each were previously impractical due to context consumption. Tool Search Tool makes this architecture viable. For workflows involving large datasets or multi-step operations, Programmatic Tool Calling eliminates the token bloat and latency that made these tasks unreliable.

Anthropic's work on building evals for AI agents showed that tool selection and parameter accuracy are critical failure modes. These features directly address both issues — dynamic discovery improves tool selection, and concrete examples improve parameter accuracy.

Try It Yourself

These features are available in beta. Enable them with the beta header and include the tools you need:

client.beta.messages.create(
    betas=["advanced-tool-use-2025-11-20"],
    model="claude-sonnet-4-5-20250929",
    max_tokens=4096,
    tools=[
        {
            "type": "tool_search_tool_regex_20251119",
            "name": "tool_search_tool_regex"
        },
        {
            "type": "code_execution_20250825",
            "name": "code_execution"
        },
        # Your tools with defer_loading, allowed_callers, and input_examples
    ]
)

For detailed implementation, see Anthropic's Tool Search Tool documentation, Programmatic Tool Calling documentation, and Tool Use Examples documentation. The cookbooks provide working examples for both Tool Search and Programmatic Tool Calling.

Best practices for implementation: Keep your three to five most-used tools always loaded, defer the rest. Write clear, descriptive tool names and descriptions for better search discovery. Document return formats clearly so Claude writes correct parsing logic in programmatic workflows. Craft examples with realistic data showing minimal, partial, and full specification patterns.

The Bottom Line

Use Tool Search Tool if you're connecting multiple MCP servers or have 10+ tools consuming more than 10,000 tokens in definitions. The 85% token reduction and accuracy improvements justify the search latency. Skip it if you have fewer than 10 compact tools that all get used frequently.

Use Programmatic Tool Calling if you're processing large datasets where you only need aggregates, running multi-step workflows with three or more dependent tool calls, or handling tasks where intermediate data shouldn't influence reasoning. The 37% token savings and latency improvements compound quickly. Skip it for simple single-tool invocations or quick lookups.

Use Tool Use Examples if you have complex nested structures, many optional parameters where inclusion patterns matter, or APIs with domain-specific conventions. The jump from 72% to 90% accuracy on complex parameters is substantial. Skip them for simple tools or standard formats.

The real risk is continuing to build agents with the old model — front-loading all tool definitions and orchestrating through sequential inference passes. That architecture doesn't scale past a dozen tools or handle workflows involving large intermediate results. These features move tool use from simple function calling toward intelligent orchestration. If you're building production agents that need to coordinate dozens of tools and process real-world data, this is the architecture that makes it possible.

Source: Anthropic