GitHub's Outage Postmortem: What Broke and What They're Fixing
GitHub published a detailed postmortem of three major outages in early 2025. Cache TTL changes, 10x client traffic spikes, and failover bugs exposed architectural limits. Here's what broke and what they're fixing.
TL;DR
- GitHub suffered three major outages in February-March 2025, caused by rapid load growth exposing architectural scaling limits
- The February 9 incident stemmed from a cache TTL change plus a 10x traffic spike from popular client apps overwhelming a core auth database
- Actions outages on Feb 2 and March 5 revealed single points of failure in failover systems
- GitHub is redesigning user cache systems, isolating critical paths, and migrating 50% of traffic to Azure by July 2025
The Big Picture
GitHub just published what every platform engineer dreads writing: a public postmortem admitting they failed their own availability standards. Over six weeks in early 2025, three major incidents took down or degraded services millions of developers depend on daily. Authentication failed. Actions pipelines stalled. Git operations crawled.
This isn't a story about a single bad deploy or a rogue configuration change. It's about what happens when explosive growth meets architectural debt. GitHub's traffic is scaling faster than their infrastructure can handle, and the cracks are showing in places that were designed for a different era of the platform.
The transparency here is notable. GitHub didn't just acknowledge the outages — they published technical details about cache TTLs, database cluster overload, and failover configuration bugs. For platform teams dealing with similar scaling challenges, this postmortem is a masterclass in what goes wrong when "simple" architectural decisions from years ago collide with modern load patterns.
How It Works (Or Didn't)
The February 9 incident is the most instructive. It started weeks earlier when two popular client applications shipped updates that unintentionally generated 10x more API traffic. Because users upgrade gradually, the load increase was invisible at first — no sudden spike, just a slow burn.
Then GitHub's team made a seemingly minor change: they reduced a cache TTL for user settings from 12 hours to 2 hours. The reason was legitimate — they needed to roll out a new model to customers faster, and the longer TTL was blocking that. But the change meant the database cluster handling authentication and user management suddenly had to serve 6x more write traffic.
Everything looked fine over the weekend. Low traffic, no alarms. Then Monday hit. Peak load returned. Users updated their client apps. Another model release went out. The database cluster collapsed under combined read and write pressure.
The real problem wasn't the TTL change or the client apps individually. It was architectural coupling. User settings, model policies, and authentication data all lived in the same database cluster. That cluster was chosen years ago for simplicity when model data was "a few bytes per user." It grew to kilobytes. Nobody caught it because the load only spiked during model rollouts, and the 12-hour TTL masked the danger.
When the cluster failed, it cascaded. Authentication depends on it. User management depends on it. Services that depend on those services failed too. GitHub didn't have granular enough controls to block just the problematic traffic upstream. They had to shut off broader swaths of functionality to stop the bleeding.
The Actions incidents on February 2 and March 5 exposed different failure modes. On Feb 2, a telemetry gap triggered security policies that blocked access to VM metadata across all regions. Normally, Actions fails over to healthy regions automatically. This time, the failure was global. On March 5, a Redis cluster failover worked as designed, but a latent configuration bug left the cluster with no writable primary. Automated failover couldn't fix it. Manual intervention took time.
Both incidents revealed single points of failure that shouldn't exist. Failover procedures that worked in testing failed in production. The gap between "we have redundancy" and "redundancy actually works under load" turned out to be wide.
What This Changes For Developers
If you're running CI/CD on GitHub Actions, the immediate takeaway is that GitHub's reliability isn't what it used to be. The platform is in the middle of a major infrastructure transition, and that means more risk of disruption over the next several months.
GitHub's response plan has two tracks. Near-term stabilization includes redesigning the user cache system to handle higher volume in a segmented database cluster, auditing critical infrastructure capacity, isolating key dependencies so Actions and Git operations can't be taken down by shared infrastructure failures, and adding better load shedding to prevent cascading failures.
The longer-term play is more ambitious. GitHub is migrating to Azure to enable both vertical scaling within regions and horizontal scaling across regions. As of the postmortem publication, 12.5% of GitHub traffic runs on Azure Central US. They're targeting 50% by July 2025. That's aggressive. It also means the next few months will be high-risk as they balance traffic between legacy infrastructure and new Azure regions.
They're also breaking apart the monolith. More isolated services, more isolated data domains, independent scaling, localized traffic shedding. This is the right architectural direction, but monolith decomposition is notoriously hard to do without introducing new failure modes. Expect more incidents as they work through this transition.
For teams that depend on GitHub, the practical advice is to build in more resilience. If your deployment pipeline assumes GitHub is always available, you're going to have a bad time over the next six months. Consider fallback strategies for critical workflows. Cache dependencies aggressively. Don't assume Actions will complete on schedule.
Resources
GitHub publishes incident summaries on their status page and detailed monthly availability reports. The February 2025 report breaks down all six incidents from that month with technical details.
If you're building platform infrastructure and want to avoid similar scaling traps, focus on these areas:
- Cache TTL changes — Always model the downstream load impact before reducing TTLs, especially for high-cardinality data like user settings or policies
- Client app instrumentation — If you control client applications that hit your APIs, instrument them to detect traffic pattern changes before they hit production at scale
- Architectural coupling audits — Map which services share critical infrastructure. If authentication and feature data live in the same database, you have a ticking time bomb
- Failover dry runs — Test failover procedures in production, not just staging. Latent configuration issues only surface under real load
The Bottom Line
GitHub is in the middle of a painful but necessary infrastructure overhaul. The outages in February and March weren't flukes — they're symptoms of a platform that outgrew its architecture. The transparency is commendable, but transparency doesn't prevent the next incident.
If you're heavily invested in GitHub for CI/CD, source control, or package hosting, plan for more instability through mid-2025. The Azure migration and monolith decomposition are the right moves long-term, but they introduce significant risk short-term. Teams running critical infrastructure on GitHub should have contingency plans.
If you're a platform engineer, this postmortem is required reading. The failure modes here — cache TTL changes cascading into database overload, client app traffic spikes going undetected, failover procedures that don't work in production — are universal. GitHub's scale is extreme, but the architectural mistakes are common. The real lesson isn't that GitHub screwed up. It's that scaling is hard, architectural debt compounds, and the gap between "it works in testing" and "it works under production load" can sink your platform.
Source: GitHub Blog
Source: GitHub Blog