🐦 Twitter Post Details

Viewing enriched Twitter post

@_jasonwei

Excited to open-source a new hallucinations eval called SimpleQA! For a while it felt like there was no great benchmark for factuality, and so we created an eval that was simple, reliable, and easy-to-use for researchers. Main features of SimpleQA: 1. Very simple setup: there are 4k diverse fact-seeking questions written by humans where there can only be a single, indisputable answer. Model completions are graded by an autograder as either correct, incorrect, or not attempted. 2. We created it so that it would be challenging for the current class of frontier models; both o1-preview and Claude Sonnet 3.5 are below 50% accuracy. 3. Reference answers have high correctness. Questions are written to be non-ambiguous and reference answers were verified by two independent annotators. Questions are also written to be timeless, so SimpleQA can be a useful benchmark even 5 or 10 years from now. The way that I think about evals is that they are an incentive for the AI community. New benchmarks in AI get saturated very quickly, and what they incentivize gets encoded into the next generation of language models. With a good hallucinations eval, hopefully the next wave of language models will be more trustworthy and reliable!

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2020-10-22T02:22:58.000Z",
    "default_profile_image": false,
    "description": "ai researcher @openai",
    "fast_followers_count": 0,
    "favourites_count": 7027,
    "followers_count": 72710,
    "friends_count": 559,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 1075,
    "location": "sf",
    "media_count": 116,
    "name": "Jason Wei",
    "normal_followers_count": 72710,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1319101874532978690/1683100754",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1812243833205239808/hZnX6Q-a_normal.jpg",
    "screen_name": "_jasonwei",
    "statuses_count": 1165,
    "translator_type": "none",
    "url": "https://t.co/p6ZTRpKDOi",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "1319101874532978690"
  },
  "id": "1851681730845118799",
  "conversation_id": "1851681730845118799",
  "full_text": "Excited to open-source a new hallucinations eval called SimpleQA! For a while it felt like there was no great benchmark for factuality, and so we created an eval that was simple, reliable, and easy-to-use for researchers. Main features of SimpleQA:\n\n1. Very simple setup: there are 4k diverse fact-seeking questions written by humans where there can only be a single, indisputable answer. Model completions are graded by an autograder as either correct, incorrect, or not attempted.  \n\n2. We created it so that it would be challenging for the current class of frontier models; both o1-preview and Claude Sonnet 3.5 are below 50% accuracy.  \n\n3. Reference answers have high correctness. Questions are written to be non-ambiguous and reference answers were verified by two independent annotators. Questions are also written to be timeless, so SimpleQA can be a useful benchmark even 5 or 10 years from now.  \n\nThe way that I think about evals is that they are an incentive for the AI community. New benchmarks in AI get saturated very quickly, and what they incentivize gets encoded into the next generation of language models. With a good hallucinations eval, hopefully the next wave of language models will be more trustworthy and reliable!",
  "reply_count": 28,
  "retweet_count": 127,
  "favorite_count": 875,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GbJ-jM7akAII589.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/_jasonwei/status/1851681730845118799",
  "created_at": "2024-10-30T17:45:01.000Z",
  "#sort_index": "1851681730845118799",
  "view_count": 101014,
  "quote_count": 11,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://x.com/_jasonwei/status/1851681730845118799"
}