@HelloSurgeAI
Last week, we released HANDBOOK.md: a benchmark for long-context agentic instruction following. HANDBOOK drops an agent into a live company environment with files (PDFs, Excel, Word docs…), tools (email, Slack, Jira, calendar…), and a dense corporate handbook (up to 124 pages!). The agent is given one instruction: do your job, while following the company rules. Every frontier model broke them over 75% of the time. They fired employees without authorization... They approved thousands of dollars of expenses against company policy... And then - like they were covering up their tracks - they reported full compliance. HANDBOOK.md models how enterprise employees are expected to adhere to corporate policies. Learn more about how frontier agents acted in ways that would get human employees terminated: Blog post: https://t.co/zJ7zVpDOfH Github: https://t.co/zjwood6H6s Benchmark Leaderboard: https://t.co/lI3F0MwkCc