🐦 Twitter Post Details

Viewing enriched Twitter post

@iScienceLuvr

Benchmarking is Broken - Don't Let AI be its Own Judge "In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench, a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress."

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1976586775603851344/media_0.jpg?",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2025-10-12T13:37:22.856191",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "1976586775603851344",
  "url": "https://x.com/iScienceLuvr/status/1976586775603851344",
  "twitterUrl": "https://twitter.com/iScienceLuvr/status/1976586775603851344",
  "text": "Benchmarking is Broken - Don't Let AI be its Own Judge\n\n\"In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench, a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.\"",
  "source": "Twitter for iPhone",
  "retweetCount": 3,
  "replyCount": 1,
  "likeCount": 24,
  "quoteCount": 0,
  "viewCount": 4036,
  "createdAt": "Fri Oct 10 09:53:05 +0000 2025",
  "lang": "en",
  "bookmarkCount": 18,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "1976586775603851344",
  "displayTextRange": [
    0,
    276
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "iScienceLuvr",
    "url": "https://x.com/iScienceLuvr",
    "twitterUrl": "https://twitter.com/iScienceLuvr",
    "id": "441465751",
    "name": "Tanishq Mathew Abraham, Ph.D.",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1913710019729821696/Qge4zx6u_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/441465751/1738204246",
    "description": "",
    "location": "",
    "followers": 82015,
    "following": 1274,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Tue Dec 20 03:45:50 +0000 2011",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 107510,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 2536,
    "statusesCount": 18375,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1966202358519763196"
    ],
    "profile_bio": {
      "description": "CEO @SophontAI |\nFounder @MedARC_AI |\nPhD at 19 (2023) |\nex Research Director Stability AI | \nBiomed. engineer @ 14 |\nTEDx talk➡https://t.co/xPxwKTq6Qb",
      "entities": {
        "description": {
          "urls": [
            {
              "display_url": "bit.ly/3tpAuan",
              "expanded_url": "https://bit.ly/3tpAuan",
              "indices": [
                128,
                151
              ],
              "url": "https://t.co/xPxwKTq6Qb"
            }
          ],
          "user_mentions": [
            {
              "id_str": "0",
              "indices": [
                4,
                14
              ],
              "name": "",
              "screen_name": "SophontAI"
            },
            {
              "id_str": "0",
              "indices": [
                25,
                35
              ],
              "name": "",
              "screen_name": "MedARC_AI"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "sophont.med",
              "expanded_url": "https://sophont.med",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/MvROZZW1Zg"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "allow_download_status": {
          "allow_download": true
        },
        "display_url": "pic.twitter.com/e4rgAjE5fk",
        "expanded_url": "https://twitter.com/iScienceLuvr/status/1976586775603851344/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {},
          "orig": {}
        },
        "id_str": "1976586623300272128",
        "indices": [
          277,
          300
        ],
        "media_key": "3_1976586623300272128",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARtuPw+9mgAACgACG24/MzOaMFAAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABG24/D72aAAAKAAIbbj8zM5owUAAA",
            "media_key": "3_1976586623300272128"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/G24_D72aAAA5_mz.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 743,
              "w": 1326,
              "x": 0,
              "y": 0
            },
            {
              "h": 1326,
              "w": 1326,
              "x": 0,
              "y": 0
            },
            {
              "h": 1512,
              "w": 1326,
              "x": 0,
              "y": 0
            },
            {
              "h": 1739,
              "w": 870,
              "x": 228,
              "y": 0
            },
            {
              "h": 1739,
              "w": 1326,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1739,
          "width": 1326
        },
        "sizes": {
          "large": {
            "h": 1739,
            "w": 1326
          }
        },
        "type": "photo",
        "url": "https://t.co/e4rgAjE5fk"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {},
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}