GitHub's Reliability Crisis: What Went Wrong and What's Next

GitHub admits two major April incidents exposed fundamental scaling problems as agentic workflows drive 30X growth. Merge queues corrupted commits, search collapsed platform-wide. Availability now trumps features.

GitHub's Reliability Crisis: What Went Wrong and What's Next

TL;DR

  • GitHub experienced two major incidents in April 2026 — merge queue operations affected 230 repositories, and search subsystem failure disrupted UI components platform-wide
  • Agentic development workflows are driving 30X scale growth, far beyond GitHub's original 10X capacity plan
  • GitHub is prioritizing availability over features, isolating critical services, and moving to multi-cloud infrastructure
  • The platform is redesigning APIs, optimizing merge queue operations, and publishing detailed root cause analyses for both incidents

The Big Picture

GitHub published a detailed update on platform availability following two significant incidents in late April 2026. On April 23, a regression in merge queue operations produced incorrect merge commits in 230 repositories. On April 27, the Elasticsearch subsystem became overloaded, disrupting search-backed UI components across pull requests, issues, and projects.

The incidents occurred during a period of exponential growth GitHub didn't initially anticipate. Since December 2025, agentic development workflows have accelerated sharply. Pull requests merged reached 90 million per month. Commits hit 1.4 billion. New repositories: 20 million monthly. GitHub started executing a 10X capacity plan in October 2025. By February 2026, they determined they needed to design for 30X scale.

The challenge isn't just traffic volume. A single pull request touches Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, small inefficiencies compound. Queues deepen. Cache misses become database load. Indexes fall behind. Retries amplify traffic. One slow dependency cascades across multiple product experiences.

GitHub's response focuses on three priorities: availability first, then capacity, then new features. The team is reducing unnecessary work, improving caching, isolating critical services, removing single points of failure, and moving performance-sensitive paths into systems designed for these workloads.

How It Works

GitHub's architecture is a distributed system where multiple components interact for every operation. The Ruby monolith handles most product logic. MySQL backs critical paths. Elasticsearch powers search. Redis manages caching. Git storage sits behind custom infrastructure.

The April 23 merge queue incident was a logic regression. Pull requests merged through merge queue using the squash merge method produced incorrect merge commits when a merge group contained more than one pull request. Changes from previously merged pull requests and prior commits were inadvertently reverted by subsequent merges. During the impact window, 2,092 pull requests across 230 repositories were affected. The issue did not affect pull requests merged outside merge queue, nor did it affect merge queue groups using merge or rebase methods. There was no data loss — all commits remained stored in Git — but the state of affected default branches was incorrect.

The April 27 search incident was a capacity failure. The Elasticsearch cluster became overloaded and stopped returning search results. GitHub's preliminary analysis indicates a botnet attack as the likely cause. Git operations and APIs were not impacted, but UI components dependent on search showed no results. Pull requests, issues, and projects were disrupted. This was a known single point of failure GitHub had not yet isolated because other areas ranked higher in their risk-prioritized reliability work.

GitHub's remediation plan addresses distributed systems fundamentals: reducing hidden coupling, limiting blast radius, and making the platform degrade gracefully when one subsystem is under pressure. Short-term work included resolving bottlenecks from moving webhooks out of MySQL, redesigning user session cache, and redoing authentication and authorization flows to reduce database load. The team leveraged their Azure migration to stand up additional compute capacity.

Next, GitHub focused on isolating critical services like Git and GitHub Actions from other workloads and minimizing single points of failure. This work started with careful analysis of dependencies and different tiers of traffic to understand what needs to be pulled apart and how to minimize impact on legitimate traffic from various attacks. The team is also accelerating migration of performance-sensitive code from Ruby to Go.

Longer-term measures include moving to multi-cloud infrastructure for resilience, low latency, and flexibility. GitHub is also investing heavily in Git system optimizations and pull request experience improvements to handle large monorepos — a harder scaling challenge than repository count alone.

What This Changes For Developers

GitHub published updated availability numbers on the status page and committed to statusing incidents both large and small. The team is improving how they categorize incidents so scale and scope are easier to understand, and working on better ways for customers to report incidents and share signals during disruptions.

For the April 23 incident, GitHub published a detailed root cause analysis with detection steps. Teams using merge queues with squash merge can review recent merges for unexpected reverts or missing commits. GitHub is handling affected repositories case-by-case through support tickets.

The shift in priorities — availability first, then capacity, then new features — means slower feature velocity. GitHub is focused on reliability work: reducing unnecessary load, improving caching, removing single points of failure. The team will publish a separate blog post soon describing extensive work on Git system optimizations and a new API design for greater efficiency and scale, including optimizations for merge queue operations in repositories with thousands of daily pull requests.

The platform's move to usage-based billing for GitHub Copilot and other recent pricing changes reflect preparation for a future where AI agents generate far more activity than human developers. Teams using agentic workflows should expect capacity controls to evolve as GitHub tunes for this new usage pattern.

Try It Yourself

GitHub's status page now includes availability numbers and detailed incident reports. Subscribe to status updates to receive notifications during disruptions.

If your team uses merge queues, review the root cause analysis for the April 23 incident. The post includes specific detection steps for identifying affected repositories. Teams that find corruption can open a support ticket for case-by-case remediation.

For teams running large monorepos or experiencing performance issues with pull requests, watch for GitHub's upcoming blog post on API redesign and merge queue optimizations. The post will detail technical improvements specifically for high-volume repositories.

The Bottom Line

Use GitHub's new status page and incident reporting if you need real-time visibility into platform health. Skip self-hosted alternatives unless you have dedicated infrastructure engineers — GitHub's scale problems are complex, and most teams won't solve them better in-house. The real opportunity is in GitHub's transparency: detailed root cause analyses, public availability metrics, and clear communication about priorities. Teams running agentic workflows should monitor the upcoming API redesign post — those optimizations will directly impact high-volume repositories. The risk is assuming quick fixes: distributed systems work takes months, and GitHub's infrastructure improvements are 2026-2027 projects, not April hotfixes.

Source: GitHub Blog

Source: GitHub Blog