🐦 Twitter Post Details

Viewing enriched Twitter post

@HelloSurgeAI

Let’s look at how frontier agents (even Opus 4.6, GPT-5.2, and Gemini 3.1 Pro!) struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows. CoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way. Even top models failed >70% of the time. Let’s dive into a failure 🧵 One task was standard customer support: A customer wanted to return an unopened motherboard. The agent needed to check return eligibility, calculate out-of-pocket costs for a swap, and recommend a replacement. The prompt specifically asked for the "most popular" replacement: "I have a customer here, Aiden Mcquarrie. He bought a motherboard in October this year and is looking for a potential replacement. . . He also wants a comparison with the next most expensive motherboard, the absolute most expensive one, and the most popular one, (based on the number of fulfilled orders containing each motherboard from the last 2 months)." The catch - to find the "most popular" item, the agent must query a production DB of historical orders to count item frequencies. The constraint - the searchOrders tool has a hard limit=10 return cap. To succeed, the agent must implement pagination logic on the fly. ❌ GPT-5.2 failed GPT-5.2 showed strong initial planning. It successfully ✅ navigated the CRM ✅ found the right order ✅ checked the delivery date to see if it was still within the return window ✅ searched for alternative boards ✅ checked whether they were compatible with Aiden’s other components. 💀 But then it hit the pagination’s ceiling. It ran 4 queries (one for each candidate board), and every single one returned exactly 10 results. In its hidden reasoning, GPT-5.2 actually noticed the problem: "All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity." Did it write a pagination loop? No. It treated limit=10 as a physical law of the universe. Instead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline. GPT-5.2's final output: "The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report." In other words, "I’m an advanced autonomous agent, but can you go bother Aisha about this?" ✅ Claude Opus 4.6 So was the task really impossible? No. Claude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution: "I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured." The database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. Opus just kept tightening the time window sequentially and eventually succeeded. ✅ Gemini 3.1 Pro Gemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach: "I need to get accurate counts. I realize that I can make multiple concurrent calls to count, and since I can't just provide a rough comparison, I'll use date slices and get the exact count." -- -- -- Overall, despite navigating much of the task without issue, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. Opus and Gemini acted like senior devs who know APIs have limits you must engineer around. That said – Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! 🥇 OpenAI -- GPT-5.2 (xHigh reasoning) 🥈 Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort) 🥉 OpenAI -- GPT-5.2 (High reasoning) 4️⃣ Google -- Gemini 3.1 Pro We’ll dive into other agentic failure patterns in subsequent threads (follow along!) Read more about EnterpriseBench and CoreCraft: Blog post - https://t.co/GUaXJ8BeP0 Paper - https://t.co/hUmkc8LDmq Leaderboard - https://t.co/UbSx9gmbnX

Media 1

📊 Media Metadata

{
  "score": 0.46,
  "score_components": {
    "author": 0.09,
    "engagement": 0.0,
    "quality": 0.16000000000000003,
    "source": 0.135,
    "nlp": 0.05,
    "recency": 0.025
  },
  "scored_at": "2026-03-01T12:10:45.044487",
  "import_source": "api_import",
  "source_tagged_at": "2026-03-01T12:10:45.044511",
  "enriched": true,
  "enriched_at": "2026-03-01T12:10:45.044514",
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2027129627471421733/media_0.png?",
      "filename": "media_0.png"
    }
  ]
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2027129627471421733",
  "url": "https://x.com/HelloSurgeAI/status/2027129627471421733",
  "twitterUrl": "https://twitter.com/HelloSurgeAI/status/2027129627471421733",
  "text": "Let’s look at how frontier agents (even Opus 4.6, GPT-5.2, and Gemini 3.1 Pro!) struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows.\n\nCoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way.\n\nEven top models failed >70% of the time. Let’s dive into a failure 🧵\n\nOne task was standard customer support:\n\nA customer wanted to return an unopened motherboard. The agent needed to check return eligibility, calculate out-of-pocket costs for a swap, and recommend a replacement.\n\nThe prompt specifically asked for the \"most popular\" replacement:\n\n\"I have a customer here, Aiden Mcquarrie. He bought a motherboard in October this year and is looking for a potential replacement. . . He also wants a comparison with the next most expensive motherboard, the absolute most expensive one, and the most popular one, (based on the number of fulfilled orders containing each motherboard from the last 2 months).\"\n\nThe catch - to find the \"most popular\" item, the agent must query a production DB of historical orders to count item frequencies.\n\nThe constraint - the searchOrders tool has a hard limit=10 return cap. To succeed, the agent must implement pagination logic on the fly.\n\n❌ GPT-5.2 failed\n\nGPT-5.2 showed strong initial planning. It successfully \n\n✅ navigated the CRM\n✅ found the right order\n✅ checked the delivery date to see if it was still within the return window\n✅ searched for alternative boards\n✅ checked whether they were compatible with Aiden’s other components.\n\n💀 But then it hit the pagination’s ceiling. It ran 4 queries (one for each candidate board), and every single one returned exactly 10 results.\n\nIn its hidden reasoning, GPT-5.2 actually noticed the problem:\n\n\"All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity.\"\n\nDid it write a pagination loop? No. It treated limit=10 as a physical law of the universe.\n\nInstead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline.\n\nGPT-5.2's final output: \"The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report.\"\n\nIn other words, \"I’m an advanced autonomous agent, but can you go bother Aisha about this?\"\n\n✅ Claude Opus 4.6 \n\nSo was the task really impossible? No. \n\nClaude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution:\n\n\"I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured.\"\n\nThe database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. \n\nOpus just kept tightening the time window sequentially and eventually succeeded.\n\n✅ Gemini 3.1 Pro\n\nGemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach:\n\n\"I need to get accurate counts. I realize that I can make multiple concurrent calls to count, and since I can't just provide a rough comparison, I'll use date slices and get the exact count.\"\n\n--\n--\n--\n\nOverall, despite navigating much of the task without issue, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. \n\nOpus and Gemini acted like senior devs who know APIs have limits you must engineer around.\n\nThat said – Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! \n\n🥇 OpenAI -- GPT-5.2 (xHigh reasoning)\n🥈 Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)\n🥉 OpenAI -- GPT-5.2 (High reasoning)\n4️⃣ Google -- Gemini 3.1 Pro\n\nWe’ll dive into other agentic failure patterns in subsequent threads (follow along!)\n\nRead more about EnterpriseBench and CoreCraft:\n\nBlog post - https://t.co/GUaXJ8BeP0\nPaper - https://t.co/hUmkc8LDmq\nLeaderboard - https://t.co/UbSx9gmbnX",
  "source": "Twitter for iPhone",
  "retweetCount": 3,
  "replyCount": 1,
  "likeCount": 18,
  "quoteCount": 1,
  "viewCount": 1444,
  "createdAt": "Thu Feb 26 21:12:20 +0000 2026",
  "lang": "en",
  "bookmarkCount": 9,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2027129627471421733",
  "displayTextRange": [
    0,
    268
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "HelloSurgeAI",
    "url": "https://x.com/HelloSurgeAI",
    "twitterUrl": "https://twitter.com/HelloSurgeAI",
    "id": "1267866160894222343",
    "name": "Surge AI",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1992439362009645056/itZea2R1_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/1267866160894222343/1763868703",
    "description": "",
    "location": "",
    "followers": 8040,
    "following": 142,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Tue Jun 02 17:10:41 +0000 2020",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 257,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 192,
    "statusesCount": 664,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1681343766123143168"
    ],
    "profile_bio": {
      "description": "Our mission is to raise AGI with the richness of humanity — curious, witty, imaginative, and full of breathtaking brilliance.",
      "entities": {
        "description": {
          "hashtags": [],
          "symbols": [],
          "urls": [],
          "user_mentions": []
        },
        "url": {
          "urls": [
            {
              "display_url": "surgehq.ai",
              "expanded_url": "https://www.surgehq.ai",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/6bGF7OxrIX"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {},
  "card": {
    "binding_values": [
      {
        "key": "photo_image_full_size_large",
        "value": {
          "image_value": {
            "height": 419,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=800x419",
            "width": 800
          }
        }
      },
      {
        "key": "thumbnail_image",
        "value": {
          "image_value": {
            "height": 150,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=280x150",
            "width": 225
          }
        }
      },
      {
        "key": "description",
        "value": {
          "string_value": "Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks. Our goal: to move agents beyond the cleanliness..."
        }
      },
      {
        "key": "domain",
        "value": {
          "string_value": "surgehq.ai"
        }
      },
      {
        "key": "thumbnail_image_large",
        "value": {
          "image_value": {
            "height": 320,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=800x320_1",
            "width": 480
          }
        }
      },
      {
        "key": "summary_photo_image_small",
        "value": {
          "image_value": {
            "height": 202,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=386x202",
            "width": 386
          }
        }
      },
      {
        "key": "thumbnail_image_original",
        "value": {
          "image_value": {
            "height": 1024,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=orig",
            "width": 1536
          }
        }
      },
      {
        "key": "photo_image_full_size_small",
        "value": {
          "image_value": {
            "height": 202,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=386x202",
            "width": 386
          }
        }
      },
      {
        "key": "summary_photo_image_large",
        "value": {
          "image_value": {
            "height": 419,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=800x419",
            "width": 800
          }
        }
      },
      {
        "key": "thumbnail_image_small",
        "value": {
          "image_value": {
            "height": 67,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=100x100",
            "width": 100
          }
        }
      },
      {
        "key": "thumbnail_image_x_large",
        "value": {
          "image_value": {
            "height": 1024,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=png&name=2048x2048_2_exp",
            "width": 1536
          }
        }
      },
      {
        "key": "photo_image_full_size_original",
        "value": {
          "image_value": {
            "height": 1024,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=orig",
            "width": 1536
          }
        }
      },
      {
        "key": "vanity_url",
        "value": {
          "scribe_key": "vanity_url",
          "string_value": "surgehq.ai"
        }
      },
      {
        "key": "photo_image_full_size",
        "value": {
          "image_value": {
            "height": 314,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=600x314",
            "width": 600
          }
        }
      },
      {
        "key": "thumbnail_image_color",
        "value": {
          "image_color_value": {
            "palette": [
              {
                "percentage": 33.18,
                "rgb": {
                  "blue": 50,
                  "green": 33,
                  "red": 42
                }
              },
              {
                "percentage": 12.77,
                "rgb": {
                  "blue": 67,
                  "green": 70,
                  "red": 126
                }
              },
              {
                "percentage": 7.4,
                "rgb": {
                  "blue": 96,
                  "green": 149,
                  "red": 231
                }
              },
              {
                "percentage": 6.05,
                "rgb": {
                  "blue": 91,
                  "green": 44,
                  "red": 44
                }
              },
              {
                "percentage": 3.78,
                "rgb": {
                  "blue": 158,
                  "green": 218,
                  "red": 250
                }
              }
            ]
          }
        }
      },
      {
        "key": "title",
        "value": {
          "string_value": "EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic, Enterprise RL Environments"
        }
      },
      {
        "key": "summary_photo_image_color",
        "value": {
          "image_color_value": {
            "palette": [
              {
                "percentage": 33.18,
                "rgb": {
                  "blue": 50,
                  "green": 33,
                  "red": 42
                }
              },
              {
                "percentage": 12.77,
                "rgb": {
                  "blue": 67,
                  "green": 70,
                  "red": 126
                }
              },
              {
                "percentage": 7.4,
                "rgb": {
                  "blue": 96,
                  "green": 149,
                  "red": 231
                }
              },
              {
                "percentage": 6.05,
                "rgb": {
                  "blue": 91,
                  "green": 44,
                  "red": 44
                }
              },
              {
                "percentage": 3.78,
                "rgb": {
                  "blue": 158,
                  "green": 218,
                  "red": 250
                }
              }
            ]
          }
        }
      },
      {
        "key": "summary_photo_image_x_large",
        "value": {
          "image_value": {
            "height": 1024,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=png&name=2048x2048_2_exp",
            "width": 1536
          }
        }
      },
      {
        "key": "summary_photo_image",
        "value": {
          "image_value": {
            "height": 314,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=600x314",
            "width": 600
          }
        }
      },
      {
        "key": "photo_image_full_size_color",
        "value": {
          "image_color_value": {
            "palette": [
              {
                "percentage": 33.18,
                "rgb": {
                  "blue": 50,
                  "green": 33,
                  "red": 42
                }
              },
              {
                "percentage": 12.77,
                "rgb": {
                  "blue": 67,
                  "green": 70,
                  "red": 126
                }
              },
              {
                "percentage": 7.4,
                "rgb": {
                  "blue": 96,
                  "green": 149,
                  "red": 231
                }
              },
              {
                "percentage": 6.05,
                "rgb": {
                  "blue": 91,
                  "green": 44,
                  "red": 44
                }
              },
              {
                "percentage": 3.78,
                "rgb": {
                  "blue": 158,
                  "green": 218,
                  "red": 250
                }
              }
            ]
          }
        }
      },
      {
        "key": "photo_image_full_size_x_large",
        "value": {
          "image_value": {
            "height": 1024,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=png&name=2048x2048_2_exp",
            "width": 1536
          }
        }
      },
      {
        "key": "card_url",
        "value": {
          "scribe_key": "card_url",
          "string_value": "https://t.co/GUaXJ8BeP0"
        }
      },
      {
        "key": "summary_photo_image_original",
        "value": {
          "image_value": {
            "height": 1024,
            "url": "https://pbs.twimg.com/card_img/2027134585776332800/hD1H6yt9?format=jpg&name=orig",
            "width": 1536
          }
        }
      }
    ],
    "card_platform": {
      "platform": {
        "audience": {
          "name": "production"
        },
        "device": {
          "name": "iPhone",
          "version": "13"
        }
      }
    },
    "name": "summary_large_image",
    "url": "https://t.co/GUaXJ8BeP0",
    "user_refs_results": []
  },
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [
      {
        "display_url": "surgehq.ai/blog/enterpris…",
        "expanded_url": "https://surgehq.ai/blog/enterprisebench-corecraft",
        "indices": [
          4246,
          4269
        ],
        "url": "https://t.co/GUaXJ8BeP0"
      },
      {
        "display_url": "arxiv.org/abs/2602.16179",
        "expanded_url": "https://arxiv.org/abs/2602.16179",
        "indices": [
          4278,
          4301
        ],
        "url": "https://t.co/hUmkc8LDmq"
      },
      {
        "display_url": "surgehq.ai/leaderboards/e…",
        "expanded_url": "https://surgehq.ai/leaderboards/enterprisebench-corecraft",
        "indices": [
          4316,
          4339
        ],
        "url": "https://t.co/UbSx9gmbnX"
      }
    ],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}