@mercor_ai
Introducing APEX-SWE, in collaboration with @Cognition. They see firsthand that real software engineering is not just writing code anymore. It's deploying systems, integrating with tools and debugging when things break. On APEX-SWE, every model fails to reliably solve the real production software engineering tasks. @OpenAI GPT-5.3 Codex (High) tops the leaderboard at 41.5% on Pass@1, followed by @AnthropicAI Opus 4.6 (High) at 40.5%. Every frontier model fails on nearly 60% of real production tasks.