🐦 Twitter Post Details

Viewing enriched Twitter post

@sivil_taram

🔍Why could a coding model trained on just 2.5T tokens compete with top-tier models like DeepSeekCoder (10T tokens) and QwenCoder (15T tokens)? 🌟 Curious about the answer? Check out our paper, OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models (🏠 https://t.co/Hh5otarsvx, 📑 https://t.co/mkEr6kBkjk), a new code language model with top-tier code generation performance and fully openness! In this paper, we reveal the full details of our data cleaning, processing, and synthesis pipeline — insights that top labs often keep under wraps for code pre-training! Here’s what we offer: ✨ 1.5B & 8B code models supporting both English and Chinese 📚 Code to reproduce the 2.5T tokens of training data (coming soon!) 🛠️ 4.5M+ high-quality SFT examples This work was lead by awesome @SimingHUAN38187 , @crazycth0901 and @ziliwang8011184 . And please find more details in this thread! 🧵

View on Twitter

📊 Media Metadata

{
  "data": [
    {
      "id": "",
      "type": "photo",
      "url": null,
      "media_url": "https://pbs.twimg.com/media/Gb9bHf3aQAAT31n.jpg",
      "media_url_https": null,
      "display_url": null,
      "expanded_url": null
    }
  ],
  "score": 0.86,
  "scored_at": "2025-08-09T13:46:07.552528",
  "import_source": "network_archive_import",
  "links_checked": true,
  "checked_at": "2025-08-10T10:32:49.693893",
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1855301760770056246/media_0.jpg?",
      "filename": "media_0.jpg"
    }
  ],
  "reprocessed_at": "2025-08-12T15:26:07.442450",
  "reprocessed_reason": "missing_media_array"
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2021-11-29T02:06:42.000Z",
    "default_profile_image": false,
    "description": "Researcher @ TikTok 🇸🇬\n\n📄 Sailor / StarCoder / OpenCoder\n💼 Past: Research Scientist @SeaAIL; PhD @MSFTResearch\n🧠 Contribution: @XlangNLP @BigCodeProject",
    "fast_followers_count": 0,
    "favourites_count": 2853,
    "followers_count": 2746,
    "friends_count": 611,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 40,
    "location": "",
    "media_count": 108,
    "name": "Qian Liu ✈️ NeurIPS 2024",
    "normal_followers_count": 2746,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1465140087193161734/1712293935",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1465140863726538754/ggAX2-ct_normal.jpg",
    "screen_name": "sivil_taram",
    "statuses_count": 1241,
    "translator_type": "none",
    "url": "https://t.co/QR510UzyRA",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "1465140087193161734"
  },
  "id": "1855301760770056246",
  "conversation_id": "1855301760770056246",
  "full_text": "🔍Why could a coding model trained on just 2.5T tokens compete with top-tier models like DeepSeekCoder (10T tokens) and QwenCoder (15T tokens)?\n\n🌟 Curious about the answer? Check out our paper, OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models (🏠 https://t.co/Hh5otarsvx, 📑 https://t.co/mkEr6kBkjk), a new code language model with top-tier code generation performance and fully openness!\n\nIn this paper, we reveal the full details of our data cleaning, processing, and synthesis pipeline — insights that top labs often keep under wraps for code pre-training! Here’s what we offer:\n\n✨ 1.5B & 8B code models supporting both English and Chinese  \n📚 Code to reproduce the 2.5T tokens of training data (coming soon!)  \n🛠️ 4.5M+ high-quality SFT examples  \n\nThis work was lead by awesome @SimingHUAN38187 , @crazycth0901 and @ziliwang8011184 . And please find more details in this thread! 🧵",
  "reply_count": 11,
  "retweet_count": 96,
  "favorite_count": 553,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/Gb9bHf3aQAAT31n.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/sivil_taram/status/1855301760770056246",
  "created_at": "2024-11-09T17:29:44.000Z",
  "#sort_index": "1855301760770056246",
  "view_count": 92743,
  "quote_count": 15,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://x.com/sivil_taram/status/1855301760770056246"
}