🐦 Twitter Post Details

Viewing enriched Twitter post

@sukh_saroy

New research just exposed the biggest lie in AI coding benchmarks. LLMs score 84-89% on standard coding tests. On real production code? 25-34%. That's not a gap. That's a different reality. Here's what happened: Researchers built a benchmark from actual open-source repositories real classes with real dependencies, real type systems, real integration complexity. Then they tested the same models that dominate HumanEval leaderboards. The results were brutal. The models weren't failing because the code was "harder." They were failing because it was *real*. Synthetic benchmarks test whether a model can write a self-contained function with a clean docstring. Production code requires understanding inheritance hierarchies, framework integrations, and project-specific utilities. Different universe. Same leaderboard score. But it gets worse. A separate study ran 600,000 debugging experiments across 9 LLMs. They found a bug in a program. The LLM found it too. Then they renamed a variable. Added a comment. Shuffled function order. Changed nothing about the bug itself. The LLM couldn't find the same bug anymore. 78% of the time, cosmetic changes that don't affect program behavior completely broke the model's ability to debug. Function shuffling alone reduced debugging accuracy by 83%. The models aren't reading code. They're pattern-matching against what code *looks like* in their training data. A third study confirmed this from another angle: when researchers obfuscated real-world code changing symbols, structure, and semantics while keeping functionality identical LLM pass rates dropped by up to 62.5%. The researchers call this the "Specialist in Familiarity" problem. LLMs perform well on code they've memorized. The moment you show them something unfamiliar with the same logic, they collapse. Three papers. Three different methodologies. Same conclusion: The benchmarks we use to evaluate AI coding tools are measuring memorization, not understanding. If you're shipping code generated by LLMs into production without review, these numbers should concern you. If you're building developer tools, the question isn't "what's your HumanEval score." It's "what happens when the code doesn't look like the training data."

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2028155568528257218/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2028155568528257218/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2026-03-02T19:10:44.321350",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2028155568528257218",
  "url": "https://x.com/sukh_saroy/status/2028155568528257218",
  "twitterUrl": "https://twitter.com/sukh_saroy/status/2028155568528257218",
  "text": "New research just exposed the biggest lie in AI coding benchmarks.\n\nLLMs score 84-89% on standard coding tests.\n\nOn real production code? 25-34%.\n\nThat's not a gap. That's a different reality.\n\nHere's what happened:\n\nResearchers built a benchmark from actual open-source repositories real classes with real dependencies, real type systems, real integration complexity.\n\nThen they tested the same models that dominate HumanEval leaderboards.\n\nThe results were brutal.\n\nThe models weren't failing because the code was \"harder.\" They were failing because it was *real*. Synthetic benchmarks test whether a model can write a self-contained function with a clean docstring. Production code requires understanding inheritance hierarchies, framework integrations, and project-specific utilities.\n\nDifferent universe. Same leaderboard score.\n\nBut it gets worse.\n\nA separate study ran 600,000 debugging experiments across 9 LLMs. They found a bug in a program. The LLM found it too. Then they renamed a variable. Added a comment. Shuffled function order. Changed nothing about the bug itself.\n\nThe LLM couldn't find the same bug anymore.\n\n78% of the time, cosmetic changes that don't affect program behavior completely broke the model's ability to debug.\n\nFunction shuffling alone reduced debugging accuracy by 83%.\n\nThe models aren't reading code. They're pattern-matching against what code *looks like* in their training data.\n\nA third study confirmed this from another angle: when researchers obfuscated real-world code changing symbols, structure, and semantics while keeping functionality identical LLM pass rates dropped by up to 62.5%.\n\nThe researchers call this the \"Specialist in Familiarity\" problem. LLMs perform well on code they've memorized. The moment you show them something unfamiliar with the same logic, they collapse.\n\nThree papers. Three different methodologies. Same conclusion:\n\nThe benchmarks we use to evaluate AI coding tools are measuring memorization, not understanding.\n\nIf you're shipping code generated by LLMs into production without review, these numbers should concern you.\n\nIf you're building developer tools, the question isn't \"what's your HumanEval score.\" It's \"what happens when the code doesn't look like the training data.\"",
  "source": "Twitter for iPhone",
  "retweetCount": 170,
  "replyCount": 68,
  "likeCount": 628,
  "quoteCount": 30,
  "viewCount": 69153,
  "createdAt": "Sun Mar 01 17:09:03 +0000 2026",
  "lang": "en",
  "bookmarkCount": 390,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2028155568528257218",
  "displayTextRange": [
    0,
    270
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "sukh_saroy",
    "url": "https://x.com/sukh_saroy",
    "twitterUrl": "https://twitter.com/sukh_saroy",
    "id": "1912650940555137025",
    "name": "Sukh Sroay",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/2026165347762774016/iuzhai8I_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/1912650940555137025/1746292109",
    "description": "",
    "location": "Edmonton",
    "followers": 8624,
    "following": 321,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Wed Apr 16 23:35:02 +0000 2025",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 3027,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 807,
    "statusesCount": 3098,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2027822875730931774"
    ],
    "profile_bio": {
      "description": "Sharing what's wild and what's practical ways to grow your business using tech, AI, and robotics • DM for Collabs",
      "entities": {
        "description": {
          "hashtags": [],
          "symbols": [],
          "urls": [],
          "user_mentions": []
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.twitter.com/o8GzEqqZEu",
        "expanded_url": "https://twitter.com/sukh_saroy/status/2028155568528257218/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "orig": {
            "faces": []
          }
        },
        "id_str": "2028155563717316608",
        "indices": [
          271,
          294
        ],
        "media_key": "3_2028155563717316608",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARwldL1UmlAACgACHCV0vnNbcMIAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABHCV0vVSaUAAKAAIcJXS+c1twwgAA",
            "media_key": "3_2028155563717316608"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/HCV0vVSaUAAK1uQ.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 1428,
              "w": 2550,
              "x": 0,
              "y": 0
            },
            {
              "h": 2550,
              "w": 2550,
              "x": 0,
              "y": 0
            },
            {
              "h": 2907,
              "w": 2550,
              "x": 0,
              "y": 0
            },
            {
              "h": 3300,
              "w": 1650,
              "x": 742,
              "y": 0
            },
            {
              "h": 3300,
              "w": 2550,
              "x": 0,
              "y": 0
            }
          ],
          "height": 3300,
          "width": 2550
        },
        "sizes": {
          "large": {
            "h": 2048,
            "w": 1583
          }
        },
        "type": "photo",
        "url": "https://t.co/o8GzEqqZEu"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}