anthropic

Inside Anthropic's Three-Week Infrastructure Nightmare

Three infrastructure bugs hit Claude simultaneously in August 2024, affecting up to 16% of requests. Anthropic's detailed postmortem reveals why detection took weeks and what they're changing.

TL;DR

Three overlapping infrastructure bugs degraded Claude responses between August and September 2024
A context window routing error, output corruption bug, and XLA:TPU compiler miscompilation all hit simultaneously
The bugs affected up to 16% of requests at peak, with "sticky" routing making some users experience consistent degradation
Anthropic never throttles model quality due to demand — these were pure infrastructure failures
Detection took weeks because the bugs produced inconsistent symptoms across platforms and their evals didn't catch the issues

The Big Picture

In late August, Claude started giving weird answers. Not consistently. Not for everyone. Just often enough that developers noticed something was off.

Some users saw Thai characters appear mid-response to English prompts. Others got syntax errors in generated code that Claude would normally catch. The reports were confusing and contradictory — some users swore quality had tanked while others saw no issues at all.

Anthropic just published a detailed postmortem explaining what happened. Three separate infrastructure bugs hit their production systems within three weeks of each other. The overlapping nature of these failures made diagnosis brutally difficult. By the time they figured it out, one bug was affecting 16% of Sonnet 3.5 requests at peak hours.

This is the kind of postmortem most companies would never publish. Anthropic is sharing technical details about their multi-platform serving infrastructure, compiler bugs in Google's XLA:TPU stack, and why their evaluation systems failed to catch any of this before users did. It's a rare look at what happens when you're serving millions of requests across AWS Trainium, NVIDIA GPUs, and Google TPUs — and three things break at once.

How It Went Wrong

The first bug landed on August 5. A routing error started sending some Sonnet 3.5 requests to servers configured for the upcoming 1M token context window. Initially it only affected 0.8% of requests. Annoying, but not catastrophic.

Then on August 25 and 26, two more bugs shipped. One was a misconfiguration on TPU servers that corrupted token generation. The other exposed a latent compiler bug in XLA:TPU's approximate top-k operation.

The real disaster came on August 29. A routine load balancing change suddenly increased the percentage of requests hitting the misconfigured servers. That first routing bug went from affecting 0.8% of requests to 16% at its worst hour on August 31.

Here's where it gets nasty: Anthropic's routing is "sticky." Once your request hit a broken server, your subsequent requests would likely hit the same broken server. Some users experienced consistent degradation while others never saw a problem. This created a debugging nightmare — contradictory reports that didn't point to any single cause.

The output corruption bug was producing tokens that should rarely appear given the context. A runtime performance optimization was occasionally assigning high probability to completely wrong tokens. Ask a question in English, get "สวัสดี" in the middle of the response. Request code, get obvious syntax errors.

The XLA:TPU compiler bug was the most technically complex. It involved mixed precision arithmetic in how Claude calculates probabilities for the next token. The models compute in bf16 (16-bit floating point), but TPU's vector processor is fp32-native. The XLA compiler optimizes by converting some operations to fp32 at runtime.

This created a precision mismatch. Operations that should have agreed on the highest probability token were running at different precision levels. Sometimes the highest probability token would just disappear from consideration entirely.

Anthropic had actually discovered a version of this bug back in December 2024 when temperature was set to zero. They deployed a workaround. But when they rewrote their sampling code on August 26 to fix the underlying precision issues, they removed the December workaround. That exposed a deeper bug in the approximate top-k operation — a performance optimization that quickly finds the highest probability tokens.

The approximate top-k bug only manifested for certain batch sizes and model configurations. The same prompt might work perfectly on one request and fail on the next. It changed behavior depending on what operations ran before or after it, and whether debugging tools were enabled.

While investigating, Anthropic discovered that exact top-k no longer had the performance penalty it once did. They switched from approximate to exact top-k and accepted the minor efficiency hit. Model quality is non-negotiable.

What This Changes For Developers

The postmortem confirms something developers suspected but couldn't prove: when Claude's quality drops, it's infrastructure bugs, not demand-based throttling. Anthropic states this explicitly — they never reduce model quality due to demand, time of day, or server load.

That matters because it changes how you should think about reliability when building on Claude. The risk isn't that Anthropic will quietly degrade your responses during peak hours to save costs. The risk is that infrastructure bugs can slip through their evaluation systems and affect production for weeks.

The "sticky" routing behavior is particularly important for developers to understand. If you're building applications that make multiple sequential requests to Claude, a single routing error could degrade an entire user session. This isn't random noise — it's systematic degradation for a subset of users.

Anthropic's detection challenges also reveal something about the limits of automated evaluation. Their benchmarks didn't catch these issues because Claude often recovers well from isolated mistakes. A single corrupted token or dropped high-probability word might not tank a benchmark score, but it degrades the user experience in ways that are immediately obvious to humans.

The privacy controls that prevented engineers from examining problematic interactions are a feature, not a bug. But they highlight why user feedback is critical. The thumbs-down button in Claude apps and the /bug command in Claude Code aren't just nice-to-haves — they're essential signal for catching issues that evals miss.

For developers building evaluation systems for their own AI applications, this postmortem is a case study in what can go wrong. Noisy evaluations that can't reliably differentiate between working and broken implementations will fail you when you need them most. Anthropic is now developing more sensitive evals and running them continuously on production systems, not just in staging.

What Anthropic Is Changing

Anthropic is overhauling their evaluation and monitoring systems in three key areas:

More sensitive evaluations: They're developing evals that can reliably differentiate between working and broken implementations. The existing benchmarks were too coarse-grained to catch subtle degradation like occasional token corruption or precision mismatches.

Quality evaluations in production: They were running regular evaluations, but not continuously on true production systems. The context window routing error wouldn't have persisted for weeks if they'd been running quality checks on actual production traffic patterns.

Faster debugging tooling: They're building infrastructure to debug community-sourced feedback without sacrificing user privacy. The current privacy controls are staying in place, but they need better tools to investigate reported issues without direct access to user interactions.

The postmortem also emphasizes the value of user feedback. Developers and researchers often create evaluation methods that complement Anthropic's internal testing. If you've built interesting ways to evaluate model quality, they want to hear about it at feedback@anthropic.com.

The Bottom Line

Use Claude if you need state-of-the-art reasoning and can tolerate occasional infrastructure issues that might take weeks to diagnose. Skip it if you need guaranteed consistency across every single request with zero tolerance for degradation.

The real risk here isn't that Anthropic is cutting corners — this postmortem proves they're not throttling quality to manage costs. The risk is that serving AI models at scale across multiple hardware platforms is genuinely hard, and even sophisticated evaluation systems can miss bugs that users spot immediately.

The real opportunity is that Anthropic is being transparent about their failures and the technical details of what went wrong. Most companies would bury this. They're publishing compiler-level debugging details and admitting their evals failed. That transparency is valuable for anyone building production AI systems.

If you're building on Claude, implement your own quality monitoring. Don't assume Anthropic's evals will catch everything. Use the feedback mechanisms they provide — the /bug command and thumbs-down button are direct lines to their engineering team. And if you're experiencing consistent degradation, it might not be your imagination. It might be sticky routing hitting a broken server pool.

Source: Anthropic

Inside Anthropic's Three-Week Infrastructure Nightmare

TL;DR

The Big Picture

How It Went Wrong

What This Changes For Developers

What Anthropic Is Changing

The Bottom Line

Read next

Claude Code 2.1.101: Team Onboarding, TLS Fixes

GitHub Copilot CLI: Agentic AI in Your Terminal

Claude Code 2.1.98: Vertex AI, Perforce, Bash Hardening