🐦 Twitter Post Details

Viewing enriched Twitter post

@chenru_duan

Is LLM ready for real scientific discovery? To find out, we gathered 50+ scientists from 20+ institutions establishing a multi-level evaluation framework: Not only on questions, but also on research scenarios and projects Current science benchmarks (like GPQA and MMMU) ask AI to answer quizzes. But science isn't a quiz. It’s an iterative loop of hypothesis, experiment, and analysis. Mastery of static, decontextualized questions, even if perfect, does not guarantee readiness to discovery, just as earning straight A’s in coursework does not indicate a great researcher. Today, we introduce Scientific Discovery Evaluation (SDE): A benchmark grounded in real-world research projects. There, research projects are decomposed into modular research scenarios from which vetted questions are sampled. LLMs are evaluated on 1. Question-level: targeted, expert-written problems embedded in real research scenarios (elucidating structure from NMR, forward reaction prediction, etc.), NOT sub-domains (analytical chemistry, inorganic materials, etc.) 2. Project-level: realistic scientific discovery loops (e.g., molecular design, materials discovery, protein engineering) where models must iteratively propose, test, and refine hypotheses. With a joint force of 50+ scientists from 20+ institutes, we gathered 8 projects, 43 research scenarios, and 1125 questions. Evaluation on these multiple levels reveals where current models succeed, where they fail, and why. It is of great joy to work with a 50+ author team in my first time of life - Thanks to you all for making it happen. @hello_jocelynlu, @YuanqiD, @BotaoYu24, @HowieH36226, @rogerluorl18, @YuanhaoQ, @YinkaiW, @Haorui_Wang123, @JeffGuo__, @SherryLixueC, @MengdiWang10, @lecong, @ParshinShojaee @KexinHuang5 @chandankreddy, @realadityanandy, @pschwllr, @KulikGroup, @hhsun1, @MoosaviSMohamad, and many others who are not in the x-universe. Also it’s exciting to see a concurrent release from @OpenAI on FrontierScience yesterday (@MilesKWang)! Their findings on the need for harder, expert-vetted evals, especially the huge performance gap between Olympiad and research questions, echo ours. SDE takes this a step further by moving beyond expert-level Q&A to explicitly evaluate the end-to-end discovery loop with project-level execution, where more finer-grained observations are thereby made possible. Core Findings Below:

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2001679625693598131/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2001679625693598131/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2025-12-22T06:26:43.428966",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2001679625693598131",
  "url": "https://x.com/chenru_duan/status/2001679625693598131",
  "twitterUrl": "https://twitter.com/chenru_duan/status/2001679625693598131",
  "text": "Is LLM ready for real scientific discovery? To find out, we gathered 50+ scientists from 20+ institutions establishing a multi-level evaluation framework: Not only on questions, but also on research scenarios and projects \n\nCurrent science benchmarks (like GPQA and MMMU) ask AI to answer quizzes. But science isn't a quiz. It’s an iterative loop of hypothesis, experiment, and analysis. Mastery of static, decontextualized questions, even if perfect, does not guarantee readiness to discovery, just as earning straight A’s in coursework does not indicate a great researcher.\n\nToday, we introduce Scientific Discovery Evaluation (SDE): A benchmark grounded in real-world research projects. There, research projects are decomposed into modular research scenarios from which vetted questions are sampled. LLMs are evaluated on\n\n1. Question-level: targeted, expert-written problems embedded in real research scenarios (elucidating structure from NMR, forward reaction prediction, etc.), NOT sub-domains (analytical chemistry, inorganic materials, etc.)\n2. Project-level: realistic scientific discovery loops (e.g., molecular design, materials discovery, protein engineering) where models must iteratively propose, test, and refine hypotheses.\nWith a joint force of 50+ scientists from 20+ institutes, we gathered 8 projects, 43 research scenarios, and 1125 questions. Evaluation on these multiple levels reveals where current models succeed, where they fail, and why.\n\nIt is of great joy to work with a 50+ author team in my first time of life - Thanks to you all for making it happen. @hello_jocelynlu, @YuanqiD, @BotaoYu24, @HowieH36226, @rogerluorl18, @YuanhaoQ, @YinkaiW, @Haorui_Wang123, @JeffGuo__, @SherryLixueC, @MengdiWang10, @lecong,  @ParshinShojaee  @KexinHuang5  @chandankreddy, @realadityanandy, @pschwllr, @KulikGroup, @hhsun1, @MoosaviSMohamad, and many others who are not in the x-universe.\n\nAlso it’s exciting to see a concurrent release from @OpenAI on FrontierScience yesterday (@MilesKWang)! Their findings on the need for harder, expert-vetted evals, especially the huge performance gap between Olympiad and research questions, echo ours. SDE takes this a step further by moving beyond expert-level Q&A to explicitly evaluate the end-to-end discovery loop with project-level execution, where more finer-grained observations are thereby made possible. \n\nCore Findings Below:",
  "source": "Twitter for iPhone",
  "retweetCount": 35,
  "replyCount": 4,
  "likeCount": 129,
  "quoteCount": 10,
  "viewCount": 122864,
  "createdAt": "Thu Dec 18 15:43:06 +0000 2025",
  "lang": "en",
  "bookmarkCount": 48,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2001679625693598131",
  "displayTextRange": [
    0,
    279
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "chenru_duan",
    "url": "https://x.com/chenru_duan",
    "twitterUrl": "https://twitter.com/chenru_duan",
    "id": "1110539695719628800",
    "name": "Chenru Duan",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1717691582537703426/miFyqsmb_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/1110539695719628800/1670592860",
    "description": "",
    "location": "Cambridge, MA",
    "followers": 1112,
    "following": 543,
    "status": "",
    "canDm": false,
    "canMediaTag": true,
    "createdAt": "Tue Mar 26 13:51:11 +0000 2019",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 1567,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 52,
    "statusesCount": 523,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2001679625693598131"
    ],
    "profile_bio": {
      "description": "Founder& CTO @DeepPrinciple; ex-Microsoft | Ph.D. @KulikGroup @MITChemistry | #AI4Science workshop organizer.",
      "entities": {
        "description": {
          "hashtags": [
            {
              "indices": [
                78,
                89
              ],
              "text": "AI4Science"
            }
          ],
          "user_mentions": [
            {
              "id_str": "0",
              "indices": [
                13,
                27
              ],
              "name": "",
              "screen_name": "DeepPrinciple"
            },
            {
              "id_str": "0",
              "indices": [
                50,
                61
              ],
              "name": "",
              "screen_name": "KulikGroup"
            },
            {
              "id_str": "0",
              "indices": [
                62,
                75
              ],
              "name": "",
              "screen_name": "MITChemistry"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.twitter.com/sFZaIFkWNJ",
        "expanded_url": "https://twitter.com/chenru_duan/status/2001679625693598131/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": [
              {
                "h": 72,
                "w": 72,
                "x": 1236,
                "y": 717
              },
              {
                "h": 210,
                "w": 210,
                "x": 961,
                "y": 479
              }
            ]
          },
          "orig": {
            "faces": [
              {
                "h": 72,
                "w": 72,
                "x": 1236,
                "y": 717
              },
              {
                "h": 210,
                "w": 210,
                "x": 961,
                "y": 479
              }
            ]
          }
        },
        "id_str": "2001679485637390336",
        "indices": [
          280,
          303
        ],
        "media_key": "3_2001679485637390336",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARvHZOMHmnAACgACG8dlA6OakbMAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABG8dk4weacAAKAAIbx2UDo5qRswAA",
            "media_key": "3_2001679485637390336"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/G8dk4weacAA2-9V.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 794,
              "w": 1418,
              "x": 0,
              "y": 0
            },
            {
              "h": 1090,
              "w": 1090,
              "x": 328,
              "y": 0
            },
            {
              "h": 1090,
              "w": 956,
              "x": 462,
              "y": 0
            },
            {
              "h": 1090,
              "w": 545,
              "x": 873,
              "y": 0
            },
            {
              "h": 1090,
              "w": 1418,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1090,
          "width": 1418
        },
        "sizes": {
          "large": {
            "h": 1090,
            "w": 1418
          }
        },
        "type": "photo",
        "url": "https://t.co/sFZaIFkWNJ"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "user_mentions": [
      {
        "id_str": "1273662591680864267",
        "indices": [
          1583,
          1599
        ],
        "name": "Jocelyn",
        "screen_name": "hello_jocelynlu"
      },
      {
        "id_str": "900784784103833600",
        "indices": [
          1601,
          1609
        ],
        "name": "Yuanqi Du",
        "screen_name": "YuanqiD"
      },
      {
        "id_str": "1572602016681304064",
        "indices": [
          1611,
          1621
        ],
        "name": "Botao Yu",
        "screen_name": "BotaoYu24"
      },
      {
        "id_str": "1778142290214797312",
        "indices": [
          1623,
          1635
        ],
        "name": "Yue Huang",
        "screen_name": "HowieH36226"
      },
      {
        "id_str": "1389301590",
        "indices": [
          1637,
          1650
        ],
        "name": "Roger Luo 罗秀哲",
        "screen_name": "rogerluorl18"
      },
      {
        "id_str": "1231288214402453504",
        "indices": [
          1652,
          1661
        ],
        "name": "Yuanhao Qu",
        "screen_name": "YuanhaoQ"
      },
      {
        "id_str": "1100851528976482304",
        "indices": [
          1663,
          1671
        ],
        "name": "Yinkai Wang",
        "screen_name": "YinkaiW"
      },
      {
        "id_str": "1158522863390875648",
        "indices": [
          1673,
          1688
        ],
        "name": "Haorui Wang",
        "screen_name": "Haorui_Wang123"
      },
      {
        "id_str": "1342502029676204035",
        "indices": [
          1690,
          1700
        ],
        "name": "Jeff Guo",
        "screen_name": "JeffGuo__"
      },
      {
        "id_str": "897628484071075844",
        "indices": [
          1702,
          1715
        ],
        "name": "Lixue Cheng",
        "screen_name": "SherryLixueC"
      },
      {
        "id_str": "1238157406086938624",
        "indices": [
          1717,
          1730
        ],
        "name": "Mengdi Wang",
        "screen_name": "MengdiWang10"
      },
      {
        "id_str": "42611116",
        "indices": [
          1732,
          1739
        ],
        "name": "CL • Le Cong",
        "screen_name": "lecong"
      },
      {
        "id_str": "1214080966127308800",
        "indices": [
          1742,
          1757
        ],
        "name": "Parshin Shojaee ✈️ NeurIPS",
        "screen_name": "ParshinShojaee"
      },
      {
        "id_str": "1025489425495678977",
        "indices": [
          1759,
          1771
        ],
        "name": "Kexin Huang",
        "screen_name": "KexinHuang5"
      },
      {
        "id_str": "466999849",
        "indices": [
          1773,
          1787
        ],
        "name": "Chandan Reddy",
        "screen_name": "chandankreddy"
      },
      {
        "id_str": "972567428550062080",
        "indices": [
          1789,
          1805
        ],
        "name": "Aditya Nandy",
        "screen_name": "realadityanandy"
      },
      {
        "id_str": "883959224669261824",
        "indices": [
          1807,
          1816
        ],
        "name": "Philippe Schwaller (he/him)",
        "screen_name": "pschwllr"
      },
      {
        "id_str": "787108096284000256",
        "indices": [
          1818,
          1829
        ],
        "name": "the Kulik Group",
        "screen_name": "KulikGroup"
      },
      {
        "id_str": "519968288",
        "indices": [
          1831,
          1838
        ],
        "name": "Huan Sun (Hiring Ph.D. students for Fall26)",
        "screen_name": "hhsun1"
      },
      {
        "id_str": "2247168358",
        "indices": [
          1840,
          1856
        ],
        "name": "Mohamad Moosavi",
        "screen_name": "MoosaviSMohamad"
      },
      {
        "id_str": "4398626122",
        "indices": [
          1958,
          1965
        ],
        "name": "OpenAI",
        "screen_name": "OpenAI"
      },
      {
        "id_str": "1611796692017225728",
        "indices": [
          1996,
          2007
        ],
        "name": "Miles Wang",
        "screen_name": "MilesKWang"
      }
    ]
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}