@echen
Everyone’s building $100M "agentic" models, so we @HelloSurgeAI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) scored under 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) My favorite: GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://t.co/mv4I1dCtOC Paper: https://t.co/EaOHmExm1r Leaderboard: https://t.co/7fb6fewGIQ