@sayashk
On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and models. Of course, you can make it harder for the agent to cheat, such as by blocking HuggingFace or encrypting the dataset. But so long as the benchmark is available somewhere in public, an agent could theoretically follow the same steps a human would to access it (e.g., decrypting a password-protected benchmark on its sandbox). So agent log analysis will become necessary even for capability evals. HAL now has logs from 20,000+ rollouts across 9 benchmarks, and we are analyzing all of these logs using @TransluceAI's Docent.