anthropic

Anthropic's Hiring Test Can't Beat Claude Anymore

Anthropic's performance engineering take-home has been defeated twice by its own Claude models. The team redesigned it three times, eventually abandoning realism for deliberately weird puzzles. Here's what they learned about AI-resistant technical evaluations.

TL;DR

Anthropic's performance engineering take-home test has been defeated twice by its own models — first Claude Opus 4, then Opus 4.5
Claude Opus 4.5 now matches the best human performance within the 2-hour time limit, making AI delegation the optimal strategy
The team redesigned the test three times, eventually abandoning realism for deliberately out-of-distribution puzzles that favor human reasoning over Claude's training data
The original test is now released as an open challenge — beat Claude's 1487-cycle score and Anthropic wants to hear from you

The Big Picture

Every technical hiring manager faces the same nightmare: you spend weeks designing the perfect take-home test, it works beautifully for months, then your own AI model solves it better than your candidates.

Tristan Hume, lead on Anthropic's performance optimization team, has lived this nightmare three times in 18 months. The take-home test he designed in late 2023 helped hire dozens of engineers who shipped every Claude model since Opus 3. It was engaging, predictive, and candidates actually enjoyed it. Some worked past the 4-hour limit because they were having fun.

Then Claude Opus 4 beat it. Hume redesigned the test, making it harder and shorter. That worked until Claude Opus 4.5 matched even the strongest human candidates within the 2-hour window. Now the optimal strategy isn't demonstrating your skills — it's sitting back and watching Claude Code work.

This isn't just an Anthropic problem. As models improve, every technical evaluation that worked last year becomes useless this year. The question isn't whether your hiring process will break — it's when, and what you do about it.

How the Original Test Worked

Hume built a Python simulator for a fake accelerator with TPU-like characteristics. Candidates optimize code running on this machine using a hot-reloading Perfetto trace showing every instruction — similar to the tooling Anthropic uses on Trainium.

The simulated machine includes manually managed scratchpad memory, VLIW instruction packing, SIMD vector operations, and multicore parallelism. The task is a parallel tree traversal inspired by branchless SIMD decision tree inference — deliberately not deep learning flavored, since most candidates hadn't worked on ML yet.

Candidates start with a fully serial implementation and progressively exploit parallelism. The warmup is multicore distribution, then they choose between SIMD vectorization or VLIW instruction packing. The original version included a bug requiring candidates to build debugging tools first.

The design philosophy was simple: make it genuinely engaging, representative of real work, and compatible with AI assistance. Longer time horizons are harder for AI to solve completely, so candidates could use AI tools as they would on the job while still needing to demonstrate their own skills.

It worked. The person who scored highest in the first batch started two weeks later and immediately began optimizing kernels. He found a workaround for a launch-blocking compiler bug involving tensor indexing math overflowing 32 bits. Over the next year, about 1,000 candidates completed it. Feedback was overwhelmingly positive.

The First Defeat: Claude Opus 4

By May 2025, Claude 3.7 Sonnet had crept up to where over 50% of candidates would have been better off delegating entirely to Claude Code. Then Hume tested a pre-release Claude Opus 4. It came up with a more optimized solution than almost all humans achieved within the 4-hour limit.

The fix was straightforward. The problem had far more depth than anyone could explore in 4 hours, so Hume used Claude Opus 4 to identify where it started struggling. That became the new starting point for version 2.

He wrote cleaner starter code, added new machine features for more depth, and removed multicore parallelism — which Claude had already solved and which only slowed development loops without adding signal. He also shortened the time limit from 4 hours to 2 hours to reduce scheduling overhead in the pipeline.

Version 2 emphasized clever optimization insights over debugging and code volume. It worked well for several months.

The Second Defeat: Claude Opus 4.5

When Hume tested a pre-release Claude Opus 4.5 checkpoint, he watched Claude Code work on the problem for 2 hours. It solved the initial bottlenecks, implemented all the common micro-optimizations, and met the passing threshold in under an hour.

Then it stopped, convinced it had hit an insurmountable memory bandwidth bottleneck. Most humans reach the same conclusion. But there are clever tricks that exploit the problem structure to work around that bottleneck.

When Hume told Claude the cycle count it was possible to achieve, it thought for a while and found the trick. It debugged, tuned, and implemented further optimizations. By the 2-hour mark, its score matched the best human performance within that time limit — and that human had made heavy use of Claude 4 with steering.

The team tested it in their internal test-time compute harness for more rigor and confirmed it could both beat humans in 2 hours and continue climbing with time. Post-launch they improved the harness in a generic way and got an even higher score.

Some colleagues suggested banning AI assistance. Hume didn't want to do this. Beyond enforcement challenges, he had a sense that given people continue to play a vital role in their work, he should be able to figure out some way for them to distinguish themselves in a setting with AI — like they'd have on the job.

Others suggested raising the bar to "substantially outperform what Claude Code achieves alone." The concern: Claude works fast. Humans typically spend half the 2 hours reading and understanding the problem before they start optimizing. A human trying to steer Claude would likely be constantly behind, understanding what Claude did only after the fact. The dominant strategy might become sitting back and watching.

The Redesign Process

Hume tried a different optimization problem first: an efficient data transposition on 2D TPU registers while avoiding bank conflicts, based on one of the trickier kernel optimizations he'd done at Anthropic. He distilled it into a simpler problem on a simulated machine and had Claude implement the changes in under a day.

Claude Opus 4.5 found a great optimization Hume hadn't even thought of. Through careful analysis, it realized it could transpose the entire computation rather than figuring out how to transpose the data, and it rewrote the whole program accordingly.

Hume patched the problem to remove that approach. Claude then made progress but couldn't find the most efficient solution. It seemed like he had his new problem. But he double-checked using Claude Code's "ultrathink" feature with longer thinking budgets — and it solved it. It even knew the tricks for fixing bank conflicts.

In hindsight, this wasn't the right problem. Engineers across many platforms have struggled with data transposition and bank conflicts, so Claude has substantial training data to draw on. While Hume had found his solution from first principles, Claude could draw on a larger toolbox of experience.

Going Deliberately Weird

Hume needed a problem where human reasoning could win over Claude's larger experience base: something sufficiently out of distribution. Unfortunately, this conflicted with his goal of being recognizably like the job.

He thought about the most unusual optimization problems he'd enjoyed and landed on Zachtronics games. These programming puzzle games use unusual, highly constrained instruction sets that force you to program in unconventional ways. In Shenzhen I/O, programs are split across multiple communicating chips that each hold only about 10 instructions with one or two state registers. Clever optimization often involves encoding state into the instruction pointer or branch flags.

Hume designed a new take-home consisting of puzzles using a tiny, heavily constrained instruction set, optimizing solutions for minimal instruction count. He implemented one medium-hard puzzle and tested it on Claude Opus 4.5. It failed. He filled out more puzzles and had colleagues verify that people less steeped in the problem than him could still outperform Claude.

Unlike Zachtronics games, he intentionally provided no visualization or debugging tools. The starter code only checks whether solutions are valid. Building debugging tools is part of what's being tested: you can either insert well-crafted print statements or ask a coding model to generate an interactive debugger in a few minutes. Judgment about how to invest in tooling is part of the signal.

Early results are promising: scores correlate well with the caliber of candidates' past work, and one of Hume's most capable colleagues scored higher than any candidate so far.

What This Means for Technical Hiring

The shift from realistic to deliberately weird problems represents a fundamental change in how we evaluate technical skills. Realism used to be a feature — now it's a liability. The original test worked because it resembled real work. The replacement works because it simulates novel work.

Performance engineers at Anthropic still have lots of work to do, but it looks more like tough debugging, systems design, performance analysis, figuring out how to verify correctness, and making Claude's code simpler and more elegant. These things are tough to test objectively without a lot of time or common context.

The problem isn't unique to Anthropic. Hume had designed a live interview question in 2023 specifically because their questions at the time were based around common tasks that early Claude models had lots of knowledge of. He tried to design a question requiring more problem solving skill than knowledge, based on a real but niche problem he'd solved at work. Claude 3 Opus beat part 1 of that question. Claude 3.5 Sonnet beat part 2.

Human experts retain an advantage over current models at sufficiently long time horizons. The fastest human solution ever submitted to the original take-home substantially exceeds what Claude has achieved even with extensive test-time compute. But most companies can't ask candidates to spend days on a take-home.

Try It Yourself

Anthropic released the original take-home for anyone to try with unlimited time. The released version starts from scratch like version 1 but uses version 2's instruction set and single-core design.

Performance benchmarks measured in clock cycles from the simulated machine:

2164 cycles — Claude Opus 4 after many hours in the test-time compute harness
1790 cycles — Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours
1579 cycles — Claude Opus 4.5 after 2 hours in the test-time compute harness
1548 cycles — Claude Sonnet 4.5 after many more than 2 hours of test-time compute
1487 cycles — Claude Opus 4.5 after 11.5 hours in the harness
1363 cycles — Claude Opus 4.5 in an improved test time compute harness after many hours

Download it on GitHub. If you optimize below 1487 cycles, beating Claude's best performance at launch, email performance-recruiting@anthropic.com with your code and a resume.

The Bottom Line

Use AI-assisted take-homes if your role involves tasks longer than a few hours where humans still have an edge. Skip them if you're evaluating skills on problems Claude has seen thousands of times in training — you're just testing who delegates fastest.

The real risk here isn't that AI makes hiring harder. It's that we keep using evaluation methods designed for a world where AI couldn't code. If your take-home worked great last year and you haven't tested it against Claude Opus 4.5, you're probably hiring for who uses AI best, not who codes best. Whether that's what you want is a different question.

Anthropic's solution — deliberately weird, out-of-distribution problems — works for now. But it sacrifices realism for resistance. That's a trade-off most companies will face soon if they haven't already. The question isn't whether your hiring process needs to change. It's whether you'll change it before or after your next model defeats it.

Source: Anthropic