Claude's "Think" Tool: When to Stop and Reason Mid-Task

Anthropic's "think" tool gives Claude structured reasoning checkpoints during execution. It delivered 54% improvement on policy-heavy tasks and 1.6% on SWE-Bench. Here's when to use it—and when to skip it.

Claude's "Think" Tool: When to Stop and Reason Mid-Task

TL;DR

  • Anthropic's "think" tool gives Claude a structured space to reason mid-task, after processing tool outputs
  • 54% improvement on complex policy-heavy tasks (τ-Bench airline domain), 1.6% boost on SWE-Bench
  • Different from extended thinking—this is for reasoning about new information during execution, not pre-planning
  • Best for multi-step tool chains and policy compliance; skip it for simple single-call scenarios

The Big Picture

Most AI coding assistants fail the same way: they commit too early. They see a problem, pick a solution, and charge ahead—only to realize three tool calls later that they misread the requirements or violated a constraint.

Anthropic's "think" tool addresses this by giving Claude a designated space to pause and reason after each step. It's not about thinking harder upfront. It's about creating checkpoints during execution where Claude can ask: "Do I have what I need? Does this comply with the rules? Should I backtrack?"

The results are striking. On τ-Bench's airline customer service domain—a benchmark designed to test policy adherence and multi-step reasoning—Claude 3.7 Sonnet with the "think" tool scored 0.570 versus 0.370 baseline. That's a 54% relative improvement. On SWE-Bench, it contributed to Claude's state-of-the-art 0.623 score with a 1.6% isolated gain.

This isn't a magic bullet. It's a tool for specific scenarios: long tool call chains, policy-heavy environments, sequential decisions where mistakes compound. For simple tasks, it adds overhead without benefit. The key is knowing when to use it.

How It Works

The "think" tool is deceptively simple. It's a standard tool definition that does nothing except log Claude's reasoning. No side effects. No database changes. Just a structured space for Claude to articulate its thought process mid-task.

Here's the basic implementation from τ-Bench:

{
  "name": "think",
  "description": "Use the tool to think about something. It will not obtain new information or change the database, but just append the thought to the log. Use it when complex reasoning or some cache memory is needed.",
  "input_schema": {
    "type": "object",
    "properties": {
      "thought": {
        "type": "string",
        "description": "A thought to think about."
      }
    },
    "required": ["thought"]
  }
}

The distinction from extended thinking matters. Extended thinking happens before Claude generates a response—it's deep pre-planning. The "think" tool activates during response generation, after Claude receives new information from tool outputs or user messages. Extended thinking is comprehensive upfront reasoning. The "think" tool is focused, incremental reasoning about what just happened.

Anthropic recommends extended thinking for straightforward scenarios: non-sequential tool calls, simple instruction following, coding and math problems that don't require external tools. The "think" tool shines when Claude needs to analyze tool outputs carefully, navigate complex policies, or make sequential decisions where each step depends on the last.

The mechanism is behavioral, not architectural. You define the tool, Claude decides when to use it. The model learns to invoke "think" at natural checkpoints: after receiving tool results, before making irreversible changes, when verifying policy compliance.

On τ-Bench, the difference was dramatic. The benchmark simulates realistic customer service scenarios where agents must follow detailed policy guidelines while using multiple tools. The pass^k metric measures consistency—the probability that all k independent trials succeed. Unlike pass@k (which measures if at least one trial succeeds), pass^k evaluates reliability. For customer service, you need consistent policy adherence, not occasional success.

In the airline domain, Claude 3.7 with the "think" tool and optimized prompting achieved 0.584 on pass^1 versus 0.332 baseline. The gap persisted across trials: 0.340 versus 0.100 at pass^5. The tool didn't just improve average performance—it made Claude more reliable on edge cases.

The retail domain showed similar patterns. The "think" tool alone (no additional prompting) reached 0.812 on pass^1 versus 0.783 baseline. The retail policy is simpler than airline, so Claude benefited from having thinking space without needing explicit guidance on how to use it.

For SWE-Bench, Anthropic adapted the tool description to emphasize brainstorming and bug-fixing strategies. The isolated effect was smaller—1.6% improvement—but statistically significant (p < .001). In code repair tasks, the "think" tool helps Claude explore multiple fix approaches before committing to changes.

What This Changes For Developers

The "think" tool shifts how you architect agentic workflows. Instead of hoping Claude reasons correctly in one shot, you design explicit reasoning checkpoints.

Consider a customer service agent that handles refund requests. Without "think," Claude might check the order status, see it's eligible, and immediately process the refund—missing that the customer has an outstanding balance or that the refund violates a promotional terms clause. With "think," Claude pauses after retrieving order details to verify all constraints before acting.

The pattern extends to code generation. When building AI agents with tool use, you want Claude to explore the codebase, identify the bug source, then explicitly reason about multiple fix approaches before editing files. The "think" tool creates that deliberation space.

Prompting strategy matters significantly. On τ-Bench's airline domain, the "think" tool with optimized prompting scored 0.584 versus 0.404 without prompting. The optimized prompt provided examples of reasoning patterns: listing applicable rules, checking for missing information, verifying policy compliance, iterating over tool results.

The prompt structure looked like this: instructions to use "think" before taking action, followed by concrete examples showing how to break down complex requests. One example demonstrated baggage fee calculation across membership tiers. Another showed payment method verification against policy constraints. These examples taught Claude not just to think, but how to think effectively in that domain.

Placement matters too. Anthropic found that complex "think" tool guidance works better in the system prompt than in the tool description itself. The system prompt provides broader context and helps Claude integrate thinking into its overall behavior pattern.

The cost tradeoff is real but manageable. Each "think" invocation adds output tokens. On policy-heavy tasks, the improved accuracy outweighs the token cost. On simple tasks, you're paying for reasoning you don't need. The key is selective deployment.

Try It Yourself

Start with a scenario where Claude currently struggles with multi-step reasoning. Customer service workflows with detailed policies are ideal test cases. Here's a minimal implementation:

{
  "name": "think",
  "description": "Use this tool to reason through complex decisions before taking action. It logs your thought process without making changes. Use it to verify policy compliance, check for missing information, or plan multi-step actions.",
  "input_schema": {
    "type": "object",
    "properties": {
      "thought": {
        "type": "string",
        "description": "Your reasoning about the current situation and next steps."
      }
    },
    "required": ["thought"]
  }
}

Add domain-specific guidance to your system prompt. For a refund processing agent, you might include:

Before processing any refund, use the think tool to verify:
- Order status and eligibility window
- Outstanding balances or pending charges
- Promotional terms that might restrict refunds
- Required approvals based on refund amount

Example reasoning:
"User requests refund for order #12345. Retrieved order details show purchase date 45 days ago. Standard policy allows refunds within 30 days, but user has Premium membership which extends to 60 days. Order total is $250, no outstanding balance. No promotional restrictions apply. Proceed with refund—no approval needed under $500."

Monitor how Claude uses the tool in practice. You'll see patterns emerge: reasoning after tool calls, verification before irreversible actions, policy checks before user-facing responses. Refine your prompts to encourage effective patterns and discourage verbose thinking that doesn't add value.

For code repair workflows similar to SWE-Bench, adapt the tool description to emphasize exploration:

{
  "name": "think",
  "description": "Use this tool to brainstorm and evaluate approaches before making code changes. After exploring the repository and identifying a bug, use this to consider multiple fix strategies and assess which is simplest and most effective. After receiving test results, use this to reason about why tests failed and plan fixes.",
  "input_schema": {
    "type": "object",
    "properties": {
      "thought": {
        "type": "string",
        "description": "Your analysis and reasoning about the code or test results."
      }
    },
    "required": ["thought"]
  }
}

The Bottom Line

Use the "think" tool if you're building agents that navigate complex policies, make sequential decisions where mistakes compound, or need to carefully analyze tool outputs before acting. The τ-Bench results prove it: 54% improvement on policy-heavy tasks isn't noise.

Skip it if your use case involves simple single-step tool calls or straightforward instruction following. Extended thinking handles those scenarios better without the mid-execution overhead. Also skip it if you're optimizing for minimum token usage—each "think" invocation costs output tokens.

The real opportunity is in agentic workflows where reliability matters more than speed. Customer service agents that must follow regulations. Code repair systems that can't afford to introduce new bugs. Financial applications where policy violations have consequences. These are scenarios where paying for structured reasoning delivers measurable value.

The risk is overuse. Adding "think" to every tool use scenario bloats your token budget without improving outcomes. Be selective. Test on your hardest cases first. If Claude's already handling a task reliably, don't add complexity.

Source: Anthropic