🐦 Twitter Post Details

Viewing enriched Twitter post

@karpathy

"My benchmark for large language models" https://t.co/YZBuwpL0tl Nice post but even more than the 100 tests specifically, the Github code looks excellent - full-featured test evaluation framework, easy to extend with further tests and run against many LLMs. https://t.co/KnmDD1AJci E.g. for the 100 current tests on 7 models: - GPT-4: 49% passed - GPT-3.5: 30% passed - Claude 2.1: 31% passed - Claude Instant 1.2: 23% passed - Mistral Medium: 25% passed - Mistral Small 21% passed - Gemini Pro: 21% passed Also a huge fan of the idea of mining tests from actual use cases in the chat history. I think people would be surprised how odd and artificial many "standard" LLM eval benchmarks can be. Now... how can a community collaborate on more of these benchmarks... 🤔

Media 1

📊 Media Metadata

{
  "data": [
    {
      "id": "",
      "type": "photo",
      "url": null,
      "media_url": "https://pbs.twimg.com/media/GGzaH1FasAA0_9a.jpg",
      "media_url_https": null,
      "display_url": null,
      "expanded_url": null
    }
  ],
  "score": 1.0,
  "scored_at": "2025-08-09T13:46:07.550016",
  "import_source": "manual_curation_2024",
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1760022429605474550/media_0.jpg?",
      "filename": "media_0.jpg",
      "original_url": "https://pbs.twimg.com/media/GGzaH1FasAA0_9a.jpg"
    }
  ],
  "storage_migrated": true
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2009-04-21T06:49:15.000Z",
    "default_profile_image": false,
    "description": "🧑‍🍳. Previously Director of AI @ Tesla, founding team @ OpenAI, CS231n/PhD @ Stanford. I like to train large deep neural nets 🧠🤖💥",
    "fast_followers_count": 0,
    "favourites_count": 10640,
    "followers_count": 941229,
    "friends_count": 893,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 12301,
    "location": "Stanford",
    "media_count": 666,
    "name": "Andrej Karpathy",
    "normal_followers_count": 941229,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/33836629/1407117611",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1296667294148382721/9Pr6XrPB_normal.jpg",
    "screen_name": "karpathy",
    "statuses_count": 8590,
    "translator_type": "none",
    "url": "https://t.co/0EcFthjJXM",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "33836629"
  },
  "id": "1760022429605474550",
  "conversation_id": "1760022429605474550",
  "full_text": "\"My benchmark for large language models\"\nhttps://t.co/YZBuwpL0tl\n\nNice post but even more than the 100 tests specifically, the Github code looks excellent - full-featured test evaluation framework, easy to extend with further tests and run against many LLMs.\nhttps://t.co/KnmDD1AJci\n\nE.g. for the 100 current tests on 7 models:\n- GPT-4: 49% passed\n- GPT-3.5: 30% passed\n- Claude 2.1: 31% passed\n- Claude Instant 1.2: 23% passed\n- Mistral Medium: 25% passed\n- Mistral Small 21% passed\n- Gemini Pro: 21% passed\n\nAlso a huge fan of the idea of mining tests from actual use cases in the chat history. I think people would be surprised how odd and artificial many \"standard\" LLM eval benchmarks can be. Now... how can a community collaborate on more of these benchmarks... 🤔",
  "reply_count": 138,
  "retweet_count": 493,
  "favorite_count": 3987,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [
    {
      "url": "https://t.co/Vr8zikr4bu",
      "expanded_url": "https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html",
      "display_url": "nicholas.carlini.com/writing/2024/m…"
    }
  ],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GGzaH1FasAA0_9a.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/karpathy/status/1760022429605474550",
  "created_at": "2024-02-20T19:23:41.000Z",
  "#sort_index": "1760022429605474550",
  "view_count": 377464,
  "quote_count": 23,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/karpathy/status/1760022429605474550"
}