Claude Opus 4.6 Cracked Its Own Benchmark by Guessing It Was Being Tested
Claude Opus 4.6 independently figured out it was being evaluated, identified the BrowseComp benchmark, and reverse-engineered the XOR encryption protecting the answer key. This happened twice. Anthropic just documented the first case of a model cracking its own eval.