🐦 Twitter Post Details

Viewing enriched Twitter post

@emollick

GDPval is one of the most important benchmarks of AI ability because it is based on human expertise. It compares expert human performance to AI performance using expert human judges who spend an average of an hour evaluating each answer. It also has holdout questions that are not public. It is also very expensive to run, and OpenAI controls it, so I understand the need for alternatives. But GDPval-AA is not a good alternative. GDPval-AA has AI models respond to the public dataset of GPDval questions, and then asks Gemini 3.1 to judge which of two LLM answers are better. There is no reason to suspect that this is highly correlated with what an expert human would think, nor is there a human baseline to compare to.

View on Twitter

📊 Media Metadata

{
  "score": 0.42,
  "score_components": {
    "author": 0.09,
    "engagement": 0.0,
    "quality": 0.12,
    "source": 0.135,
    "nlp": 0.05,
    "recency": 0.025
  },
  "scored_at": "2026-04-18T01:06:01.561450",
  "import_source": "api_import",
  "source_tagged_at": "2026-04-18T01:06:01.561462",
  "enriched": true,
  "enriched_at": "2026-04-18T01:06:01.561464"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2045307368993374597",
  "url": "https://x.com/emollick/status/2045307368993374597",
  "twitterUrl": "https://twitter.com/emollick/status/2045307368993374597",
  "text": "GDPval is one of the most important benchmarks of AI ability because it is based on human expertise. It compares expert human performance to AI performance using expert human judges who spend an average of an hour evaluating each answer. It also has holdout questions that are not public.\n\nIt is also very expensive to run, and OpenAI controls it, so I understand the need for alternatives. But GDPval-AA is not a good alternative. GDPval-AA has AI models respond to the public dataset of GPDval questions, and then asks Gemini 3.1 to judge which of two LLM answers are better. There is no reason to suspect that this is highly correlated with what an expert human would think, nor is there a human baseline to compare to.",
  "source": "Twitter for iPhone",
  "retweetCount": 0,
  "replyCount": 1,
  "likeCount": 0,
  "quoteCount": 0,
  "viewCount": 184,
  "createdAt": "Sat Apr 18 01:04:11 +0000 2026",
  "lang": "en",
  "bookmarkCount": 0,
  "isReply": true,
  "inReplyToId": "2045305504679842279",
  "conversationId": "2045305504679842279",
  "displayTextRange": [
    0,
    280
  ],
  "inReplyToUserId": "39125788",
  "inReplyToUsername": "emollick",
  "author": {
    "type": "user",
    "userName": "emollick",
    "url": "https://x.com/emollick",
    "twitterUrl": "https://twitter.com/emollick",
    "id": "39125788",
    "name": "Ethan Mollick",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1601382188712398850/3AAOlqrX_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/39125788/1763321302",
    "description": "",
    "location": "Philadelphia, PA",
    "followers": 346016,
    "following": 585,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Sun May 10 22:33:52 +0000 2009",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 7131,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 14178,
    "statusesCount": 35164,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1980069106855584157"
    ],
    "profile_bio": {
      "description": "Professor @Wharton studying AI, innovation & startups. Democratizing education using tech\nBook: https://t.co/CSmipbJ2jV\nSubstack: https://t.co/UIBhxu4bgq",
      "entities": {
        "description": {
          "urls": [
            {
              "display_url": "a.co/d/4VguzZz",
              "expanded_url": "https://a.co/d/4VguzZz",
              "indices": [
                96,
                119
              ],
              "url": "https://t.co/CSmipbJ2jV"
            },
            {
              "display_url": "oneusefulthing.org",
              "expanded_url": "https://www.oneusefulthing.org/",
              "indices": [
                130,
                153
              ],
              "url": "https://t.co/UIBhxu4bgq"
            }
          ],
          "user_mentions": [
            {
              "id_str": "",
              "indices": [
                10,
                18
              ],
              "name": "",
              "screen_name": "Wharton"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "mgmt.wharton.upenn.edu/profile/emolli…",
              "expanded_url": "https://mgmt.wharton.upenn.edu/profile/emollick/",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/uItckI9ujc"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {},
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "communityInfo": null,
  "article": null
}