@HelloSurgeAI
Everyone’s building $100M "agentic" models, so we built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! 👑) barely scored 30%. The #2 model (GPT-5.2 🥈) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at 🥉) The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://t.co/GUaXJ8BeP0 Paper: https://t.co/1BMiTUdM66 Leaderboard: https://t.co/UbSx9gmbnX