When Defense Systems Become Technical Debt: GitHub's Rate Limit Cleanup
GitHub's emergency rate limits from past incidents were blocking legitimate users. The lesson: defense systems need the same lifecycle management as features — observability, expiration dates, and active maintenance.
TL;DR
- GitHub's emergency rate limits from past incidents were blocking legitimate users during normal browsing
- Only 0.003-0.004% of traffic was affected, but any false positive is unacceptable
- The root cause: incident mitigations added quickly during attacks were never reviewed or removed
- Defense mechanisms need the same lifecycle management as features — observability, expiration dates, and post-incident reviews
The Big Picture
Running a platform at GitHub's scale means building layers of defense. Rate limits, traffic controls, fingerprinting systems. They're essential for keeping the service up during attacks and abuse.
But here's the problem: Emergency protections added during incidents don't come with expiration dates. They get deployed fast, work well enough to stop the immediate threat, and then... they just stay there. Months later, threat patterns evolve. Legitimate tools change. User behavior shifts. And those same protections start blocking real users.
GitHub just cleaned up a set of outdated rate limits that were hitting legitimate users with "too many requests" errors during normal browsing. The impact was small — roughly 3-4 requests per 100,000 — but for the users affected, it was disruptive and confusing. This is a textbook case of technical debt in infrastructure: controls that were correct when deployed but became liabilities without active maintenance.
The lesson here isn't just about rate limits. It's about treating defense systems with the same rigor you apply to features. They need observability. They need lifecycle management. They need someone asking "is this still serving its purpose?"
How Protection Systems Accumulate Debt
GitHub's protection infrastructure is multi-layered. Requests flow through edge tier, application tier, service tier, and backend. Each layer can rate-limit or block based on different signals.
During an incident, you add controls wherever you can deploy them fastest. Maybe that's a fingerprint-based rule at the edge. Maybe it's business logic in the application layer. The goal is stopping the attack, not building the perfect long-term solution.
The rules GitHub removed were composite signals: industry-standard fingerprinting combined with platform-specific business logic. When both conditions matched, requests were blocked 100% of the time. Among requests that matched the suspicious fingerprints, only 0.5-0.9% also triggered the business logic rules and got blocked.
That filtering worked during the original incidents. But over time, some legitimate clients started matching those same patterns. The false positive rate was tiny — 0.003-0.004% of total traffic — but consistent. Users following GitHub links from other apps or just browsing normally were hitting rate limits meant for abusers.
This is the lifecycle problem: Control added during incident → Works initially → Remains active without review → Eventually blocks legitimate traffic. Without expiration dates or post-incident reviews, every emergency mitigation becomes permanent technical debt.
GitHub's infrastructure team has dealt with similar challenges before. Their rebuild of GitHub Actions showed how systems built for one scale need fundamental rethinking as usage grows. Defense systems have the same problem, but the symptoms are harder to spot.
What This Changes For Developers
If you're running infrastructure at scale, this should sound familiar. You've probably got rate limits and abuse controls scattered across your stack. Some were added last month. Some were added three years ago during an incident nobody remembers.
The tracing problem is real. When a user reports a block, you need to correlate logs across multiple systems with different schemas. GitHub's investigation went: user reports → edge tier logs → application tier logs → protection rule analysis. Each step required different tools and context.
The fix isn't just removing outdated rules. It's building lifecycle management into how you operate defense systems. That means:
Treating incident mitigations as temporary by default. If a rule needs to be permanent, that should require documentation and a conscious decision.
Building visibility across all protection layers so you can trace which layer blocked a request and why. Distributed logs aren't enough if you can't correlate them quickly.
Conducting post-incident reviews that specifically evaluate emergency controls. Did they work? Are they still needed? What's the false positive rate now versus when we deployed them?
This is the same discipline GitHub applied to their availability incidents in November 2025 — treating operational problems as engineering problems that need systematic solutions, not just quick fixes.
The Observability Gap
GitHub's post-mortem highlights a critical gap: observability for defense systems lags behind observability for features. You probably have detailed metrics on API latency, error rates, and user flows. Do you have the same visibility into which protection rules are firing, how often, and against what traffic patterns?
The challenge is that protection systems are designed to be opaque to attackers. You don't want to expose which signals you're using or how rules are structured. But that same opacity makes it hard for your own team to understand what's happening.
GitHub's solution involves better visibility across all protection layers and treating mitigations as temporary by default. The specifics matter less than the principle: defense mechanisms need the same instrumentation and lifecycle management as the systems they protect.
Try It Yourself
If you're managing rate limits or abuse controls, audit them now. For each rule, ask:
- When was this added and why?
- What traffic pattern was it meant to block?
- Is that pattern still a threat today?
- What's the false positive rate?
- Who owns this rule and reviews it?
If you can't answer those questions, you've got technical debt. Start with rules added during incidents more than six months ago. Those are the most likely to have outlived their purpose.
For new mitigations, set expiration dates. Use calendar reminders, tickets, or automated alerts. Make the default behavior "this rule expires in 30 days unless someone explicitly renews it." That forces the review conversation.
The Bottom Line
Use this approach if you're running infrastructure at scale and dealing with abuse or attacks. The multi-layer defense model is sound, but only if you maintain it actively. Skip this if you're early-stage and don't have abuse problems yet — premature optimization in defense systems is worse than no defense.
The real risk is treating incident mitigations as fire-and-forget. They're not. Every emergency control you deploy without an expiration date or review process becomes technical debt that will eventually block legitimate users. GitHub caught this because users reported it publicly. How many outdated rules are sitting in your stack right now, quietly blocking a small percentage of legitimate traffic that hasn't complained yet?
Defense systems need lifecycle management. Build it in from the start, or you'll be cleaning up technical debt later while apologizing to users.
Source: GitHub Blog