GitHub Enterprise Server Search: Rebuilt for High Availability
GitHub rebuilt Enterprise Server's search architecture using Elasticsearch CCR to eliminate HA deployment deadlocks. The new single-node cluster approach prevents primary shards from landing on read-only replicas—available now in GHES 3.19.1.
TL;DR
- GitHub rebuilt Enterprise Server's search architecture using Elasticsearch's Cross Cluster Replication (CCR) to eliminate years of HA deployment headaches
- Old architecture created deadlock scenarios where replica nodes could trap primary shards, forcing manual intervention during maintenance
- New single-node cluster approach with CCR replication respects leader/follower patterns and prevents data from landing on read-only nodes
- Available now in GHES 3.19.1 as opt-in, becoming default over next two years
The Big Picture
Search isn't just the search bar on GitHub. It powers issue filtering, release pages, project views, PR counts, and dozens of other surfaces developers touch daily. When search breaks in GitHub Enterprise Server, entire workflows grind to halt.
For years, Enterprise Server administrators lived with a fragile search architecture. Follow the upgrade steps out of order? Corrupted indexes. Take a replica down for maintenance at the wrong moment? Deadlock. The root cause was Elasticsearch's clustering model clashing with Enterprise Server's leader/follower replication pattern.
GitHub's engineering team spent years trying to patch this fundamental mismatch. They built health checks, drift correction systems, even attempted a full "search mirroring" rewrite. Nothing stuck. Database replication is hard, and bolting consistency guarantees onto an incompatible architecture is harder.
Now they've rebuilt it from scratch using Elasticsearch's Cross Cluster Replication. The new architecture eliminates the deadlock scenarios, removes the manual repair workflows, and finally makes search in HA deployments boring—which is exactly what infrastructure should be.
How It Works
The old architecture treated primary and replica nodes as a single Elasticsearch cluster. This gave performance wins—each node could handle search requests locally—but created operational nightmares.
Elasticsearch manages data in shards. Primary shards handle writes and validation. Replica shards are read-only copies. In a multi-node cluster, Elasticsearch can move primary shards between nodes for load balancing. That's fine when all nodes are equal.
But Enterprise Server nodes aren't equal. The primary node handles all writes and traffic. Replica nodes are read-only standby systems designed for failover. When Elasticsearch moved a primary shard to a replica node, then that replica went down for maintenance, the system locked. The replica wouldn't start until Elasticsearch was healthy. Elasticsearch couldn't become healthy until the replica rejoined. Deadlock.
The new architecture flips the model. Each Enterprise Server node now runs its own single-node Elasticsearch cluster. No shared cluster means no shard migration between nodes. Cross Cluster Replication handles data sync between these independent clusters.
CCR replicates data after it's been persisted to Lucene segments—Elasticsearch's underlying storage layer. This ensures only durably written data gets replicated. The replication respects the leader/follower pattern: primary node writes, replica node follows. No more primary shards landing on read-only nodes.
The engineering challenge wasn't just enabling CCR. Elasticsearch's auto-follow API only applies to indexes created after the policy exists. Enterprise Server installations have long-lived indexes that need migration. GitHub built custom bootstrap workflows to attach followers to existing indexes, then enable auto-follow for future indexes.
They also engineered workflows for failover, index deletion, and upgrades. Elasticsearch handles document replication. GitHub's code handles the rest of the index lifecycle—ensuring replicas stay in sync during topology changes, cleaning up follower indexes when leaders delete them, and managing the migration from old to new architecture during upgrades.
The migration process consolidates all data onto primary nodes, breaks the old cluster topology, and restarts replication using CCR. For large Enterprise Server instances with terabytes of indexed data, this can take hours. But it's a one-time cost that eliminates years of operational friction.
What This Changes For Developers
If you're running GitHub Enterprise Server in HA mode, this changes your maintenance windows. No more carefully orchestrated upgrade sequences to avoid corrupting search indexes. No more manual repair workflows when a replica comes back online with stale data.
The new architecture makes failover cleaner. When a primary node fails and a replica takes over, search keeps working because each node has a complete, independent Elasticsearch cluster. The old architecture required careful coordination to ensure the cluster reformed correctly after failover.
For platform teams managing Enterprise Server, this reduces the operational surface area. Search becomes infrastructure you don't think about. That's the goal. GitHub's own reliability engineering team has been running this architecture internally, and the reduction in search-related incidents is significant enough that they're making it the default over the next two years.
The opt-in period gives administrators time to test the migration in staging environments and provide feedback. GitHub wants to catch edge cases before forcing the switch. If you're running custom search configurations or have unusually large indexes, now is the time to test.
This work also sets the foundation for future search improvements. The old architecture's fragility made it risky to ship new search features—any change could trigger the deadlock scenarios. With a stable replication model, GitHub can iterate faster on search quality, performance, and new surfaces. Similar to how GitHub's agentic workflows required rethinking security architecture, this search rebuild required rethinking the entire replication model.
Try It Yourself
The new CCR mode is available in GitHub Enterprise Server 3.19.1. To enable it, contact support@github.com and request access to the new HA mode. They'll provision a license that unlocks the feature.
Once you have the license, the migration is a config change and a restart:
# Enable CCR mode
ghe-config app.elasticsearch.ccr true
# Apply configuration (or run during upgrade to 3.19.1)
ghe-config-apply
During the restart, Elasticsearch migrates your installation to the new replication method. Monitor the migration logs—large instances with millions of indexed documents will take longer to consolidate data onto the primary node and establish CCR replication.
Test failover scenarios in staging before rolling to production. Verify that search works correctly after promoting a replica to primary. Check that index deletion on the primary correctly removes follower indexes on replicas.
The Bottom Line
Use this if you're running GitHub Enterprise Server in HA mode and tired of babysitting search indexes during maintenance windows. The migration is straightforward, and the operational wins are immediate.
Skip it if you're on a single-node Enterprise Server deployment—CCR only applies to HA setups. Also skip it if you're running a heavily customized Elasticsearch configuration that might conflict with GitHub's CCR workflows. Test in staging first.
The real opportunity here is GitHub finally treating search infrastructure as a solved problem. Years of engineering effort went into patching an incompatible architecture. Now they've rebuilt it correctly, and that engineering capacity can go toward features that actually differentiate the platform. For administrators, that means less time debugging search deadlocks and more time shipping value to developers.
Source: GitHub Blog