Anthropic's Hiring Test Got Beat by Claude — Twice
Anthropic's performance engineering take-home has been redesigned three times because each new Claude model solved it. The latest version uses deliberately weird puzzles to stay ahead of AI — because realism is now a luxury they can't afford.
TL;DR
- Anthropic's performance engineering take-home test has been redesigned three times because each new Claude model solved it
- Claude Opus 4.5 now matches the best human performance in 2 hours on the original test
- The new version uses deliberately weird, out-of-distribution puzzles to stay ahead of AI capabilities
- If you're a performance engineer who can beat Claude's unlimited-time score, Anthropic wants to hear from you
The Big Picture
Tristan Hume has a problem most companies would envy: the AI his team builds keeps defeating the hiring test he designed to find engineers who can optimize that AI.
Since early 2024, Anthropic's performance engineering team has used a take-home test where candidates optimize code for a simulated accelerator. Over 1,000 people have completed it. Dozens got hired, including engineers who brought up Anthropic's Trainium cluster and shipped every model since Claude 3 Opus.
Then Claude Opus 4 outperformed most human applicants given the same time limit. That was fine — it still let them identify the strongest candidates. But Claude Opus 4.5 matched even those top performers. Under the 2-hour constraint, there was no longer a way to distinguish between elite human output and what the model could produce.
This isn't an abstract problem about AI replacing jobs. It's a concrete engineering challenge: how do you evaluate human technical skill when the AI you're building can solve your evaluation faster than humans can?
Hume's solution reveals something uncomfortable about the future of technical hiring. The new test works not because it resembles real work, but because it's deliberately weird — constrained programming puzzles using tiny instruction sets that force unconventional thinking. Realism, he writes, "may be a luxury we no longer have."
How the Original Test Worked
The take-home was designed around a Python simulator for a fake accelerator with TPU-like characteristics. Candidates optimized a parallel tree traversal using features that make accelerator work interesting: manually managed scratchpad memory, VLIW instruction packing, SIMD vectorization, and multicore distribution.
The problem deliberately avoided deep learning flavor. Most performance engineers hadn't worked on ML yet and could learn domain specifics on the job. Instead, it drew from branchless SIMD decision tree inference — a classical ML optimization challenge that only a few candidates had encountered.
Candidates started with a fully serial implementation and progressively exploited parallelism. The warmup was multicore. Then they chose between SIMD vectorization or VLIW instruction packing. The original version included a bug that required building debugging tools to find.
The format had advantages over live interviews. Four hours (later reduced to two) better reflected real work than 50-minute sessions. Candidates worked in their own environment without someone watching. They had time to understand systems and build tooling — both hard to evaluate in standard interviews.
Critically, candidates could use AI assistance. Anthropic's guidance explicitly allowed it. Longer-horizon problems are harder for AI to solve completely, so candidates could use tools like they would on the job while still needing to demonstrate their own skills.
It worked. The highest scorer from the first batch started in February 2024 and immediately began optimizing kernels. He found a workaround for a launch-blocking compiler bug involving tensor indexing math overflowing 32 bits. The test proved predictive.
Over the next year and a half, it helped hire most of Anthropic's current performance engineering team. It was especially valuable for candidates with limited paper credentials — several top performers came straight from undergrad but showed enough skill to hire confidently.
Many candidates worked past the time limit because they were enjoying it. The strongest unlimited-time submissions included full optimizing mini-compilers and clever optimizations Hume hadn't anticipated.
What This Changes For Developers
By May 2025, Claude 3.7 Sonnet had reached the point where over 50% of candidates would have been better off delegating entirely to Claude Code. Then a pre-release Claude Opus 4 came up with a more optimized solution than almost all humans achieved in 4 hours.
The fix was straightforward. The problem had more depth than anyone could explore in 4 hours, so Hume used Claude Opus 4 to identify where it started struggling. That became the new starting point for version 2. He wrote cleaner starter code, added new machine features for more depth, and removed multicore (which Claude had already solved).
He also shortened the time limit to 2 hours. The original 4-hour window caused multi-week scheduling delays. Two hours fits into a weekend.
Version 2 emphasized clever optimization insights over debugging and code volume. It worked for several months. Then Claude Opus 4.5 defeated it.
Hume watched Claude Code work on the problem for 2 hours, gradually improving its solution. It solved initial bottlenecks, implemented common micro-optimizations, and met the passing threshold in under an hour. Then it stopped, convinced it had hit an insurmountable memory bandwidth bottleneck.
Most humans reach the same conclusion. But there are clever tricks that exploit problem structure to work around that bottleneck. When Hume told Claude the cycle count it was possible to achieve, it thought for a while and found the trick. It debugged, tuned, and implemented further optimizations. By the 2-hour mark, its score matched the best human performance — and that human had made heavy use of Claude 4 with steering.
Anthropic tested it in their internal test-time compute harness for more rigor. It confirmed Claude could both beat humans in 2 hours and continue climbing with time. Post-launch they improved the harness generically and got an even higher score.
Some colleagues suggested banning AI assistance. Hume didn't want to do this. Beyond enforcement challenges, he had a sense that given people continue to play a vital role in the work, he should be able to figure out some way for them to distinguish themselves with AI — like they'd have on the job.
Others suggested raising the bar to "substantially outperform what Claude Code achieves alone." The concern: Claude works fast. Humans typically spend half the 2 hours reading and understanding before they start optimizing. A human steering Claude would likely be constantly behind, understanding what Claude did only after the fact. The dominant strategy might become sitting back and watching.
Performance engineers at Anthropic still have lots of work to do, but it looks more like tough debugging, systems design, performance analysis, figuring out how to verify correctness, and making Claude's code simpler and more elegant. These things are tough to test objectively without lots of time or common context.
The Redesign Process
Hume's first attempt was a different optimization problem based on one of the trickier kernel optimizations he'd done at Anthropic: efficient data transposition on 2D TPU registers while avoiding bank conflicts. He distilled it into a simpler problem on a simulated machine and had Claude implement the changes in under a day.
Claude Opus 4.5 found a great optimization he hadn't even thought of. Through careful analysis, it realized it could transpose the entire computation rather than figuring out how to transpose the data. It rewrote the whole program accordingly.
In the real case, this wouldn't have worked, so Hume patched the problem to remove that approach. Claude then made progress but couldn't find the most efficient solution. It seemed like he had his new problem. But he double-checked using Claude Code's "ultrathink" feature with longer thinking budgets. It solved it. It even knew the tricks for fixing bank conflicts.
In hindsight, this wasn't the right problem. Engineers across many platforms have struggled with data transposition and bank conflicts, so Claude has substantial training data to draw on. While Hume had found his solution from first principles, Claude could draw on a larger toolbox of experience.
He needed a problem where human reasoning could win over Claude's larger experience base: something sufficiently out of distribution. Unfortunately, this conflicted with his goal of being recognizably like the job.
He thought about the most unusual optimization problems he'd enjoyed and landed on Zachtronics games. These programming puzzle games use unusual, highly constrained instruction sets that force you to program in unconventional ways. In Shenzhen I/O, programs are split across multiple communicating chips that each hold only about 10 instructions with one or two state registers. Clever optimization often involves encoding state into the instruction pointer or branch flags.
The new take-home consists of puzzles using a tiny, heavily constrained instruction set, optimizing solutions for minimal instruction count. He implemented one medium-hard puzzle and tested it on Claude Opus 4.5. It failed. He filled out more puzzles and had colleagues verify that people less steeped in the problem could still outperform Claude.
Unlike Zachtronics games, he intentionally provided no visualization or debugging tools. The starter code only checks whether solutions are valid. Building debugging tools is part of what's being tested: you can either insert well-crafted print statements or ask a coding model to generate an interactive debugger in a few minutes. Judgment about how to invest in tooling is part of the signal.
Early results are promising. Scores correlate well with the caliber of candidates' past work. One of Hume's most capable colleagues scored higher than any candidate so far.
He's still sad to have given up the realism and varied depth of the original. But realism may be a luxury we no longer have. The original worked because it resembled real work. The replacement works because it simulates novel work.
Try It Yourself
Anthropic released the original take-home for anyone to try with unlimited time. Human experts retain an advantage over current models at sufficiently long time horizons. The fastest human solution ever submitted substantially exceeds what Claude has achieved even with extensive test-time compute.
The released version starts from scratch (like version 1) but uses version 2's instruction set and single-core design, so cycle counts are comparable to version 2.
Performance benchmarks measured in clock cycles from the simulated machine:
- 2164 cycles — Claude Opus 4 after many hours in the test-time compute harness
- 1790 cycles — Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours
- 1579 cycles — Claude Opus 4.5 after 2 hours in the test-time compute harness
- 1548 cycles — Claude Sonnet 4.5 after many more than 2 hours of test-time compute
- 1487 cycles — Claude Opus 4.5 after 11.5 hours in the harness
- 1363 cycles — Claude Opus 4.5 in an improved test-time compute harness after many hours
Download it on GitHub. If you optimize below 1487 cycles, beating Claude's best performance at launch, email performance-recruiting@anthropic.com with your code and a resume.
Or apply through Anthropic's typical process, which uses their (now) Claude-resistant take-home.
The Bottom Line
Use this test if you're a performance engineer who wants to prove you can outthink Claude at optimization problems. Skip it if you're looking for a realistic preview of day-to-day work — that's no longer what the test provides.
The real risk here isn't that AI will replace performance engineers. Anthropic still needs more of them, and the work remains challenging. The risk is that we're losing the ability to evaluate technical skill through realistic simulations of the job. When your hiring test needs to be deliberately weird to stay ahead of your own models, you're admitting that AI can already handle the straightforward parts.
The opportunity is different: if you can beat 1363 cycles with unlimited time, you're demonstrating something Claude still can't do. That's worth more than any credential on paper. The question is how long that advantage lasts.
Source: Anthropic