🐦 Twitter Post Details

Viewing enriched Twitter post

@LoubnaBenAllal1

🌌 Cosmopedia: The largest open synthetic dataset of textbooks, blogposts and stories generated by Mixtral with a total of 25B tokens and 30M files 🚀 https://t.co/qtM06YXex7 A little backstory to this "cosmic" journey: Two weeks ago I started experimenting with some cool web clustering from @lvwerra and synthetic data prompts by @gui_penedo and @Thom_Wolf . I was incredibly impressed by the quality and diversity of the generations when using Mixtral-8x7B-Instruct-v0.1. Then 600 H100 GPUs became free on the HF cluster for a night! 💡 Given we had the llm-swarm library by @vwxyzjn which scales efficiently for data generation, @anton_lozhkov and I scraped some data and launched the pipeline at full capacity, resulting in 25 billion tokens from textbooks, blog posts, and stories with GPT-3.5 quality, all under Apache2.0 license. This makes Cosmopedia the largest synthetic dataset available 🚀. We also trained a Phi-like model on it, cosmo-1b, to test the quality of the dataset: https://t.co/CedBBp69WP It's comparable to other 1B models on a couple evals :) We're sharing it all with you: the dataset, prompts, and end-to-end pipeline https://t.co/91UrNuM0it This is version 0.1 of the dataset, with significant room for improvement: additional generation styles, languages, better coverage of scientific topics and even better models! Super excited about what the community will build on top of it 🚀. Enjoy! ✨ Bonus: distribution charts! 📊 We build each prompt based on a sample in a seed dataset (e.g the web or Stanford courses), we ask Mixtral to generate a specific format (e.g textbook or a story) and for target a specific audience.

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2020-09-09T09:34:42.000Z",
    "default_profile_image": false,
    "description": "ML Engineer @huggingface 🤗 | @ENS_ParisSaclay - MVA",
    "fast_followers_count": 0,
    "favourites_count": 1546,
    "followers_count": 3232,
    "friends_count": 588,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 105,
    "location": "Paris, France",
    "media_count": 71,
    "name": "Loubna Ben Allal",
    "normal_followers_count": 3232,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1303627850025705473/1661595065",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1746966684471009280/pOPQmf6H_normal.jpg",
    "screen_name": "LoubnaBenAllal1",
    "statuses_count": 461,
    "translator_type": "none",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "1303627850025705473"
  },
  "id": "1760015709319348258",
  "conversation_id": "1760015709319348258",
  "full_text": "🌌 Cosmopedia: The largest open synthetic dataset of textbooks, blogposts and stories generated by Mixtral with a total of 25B tokens and 30M files 🚀\nhttps://t.co/qtM06YXex7\n\nA little backstory to this \"cosmic\" journey: Two weeks ago I started experimenting with some cool web clustering from @lvwerra  and synthetic data prompts  by @gui_penedo  and @Thom_Wolf . I was incredibly impressed by the quality and diversity of the generations when using Mixtral-8x7B-Instruct-v0.1.  Then 600 H100 GPUs became free on the HF cluster for a night!\n\n💡 Given we had the llm-swarm library by @vwxyzjn  which scales efficiently for data generation, @anton_lozhkov  and I scraped some data and launched the pipeline at full capacity, resulting in 25 billion tokens from textbooks, blog posts, and stories with GPT-3.5 quality, all under Apache2.0 license.\n\nThis makes Cosmopedia the largest synthetic dataset available 🚀.\n\nWe also trained a Phi-like model on it, cosmo-1b, to test the quality of the dataset: https://t.co/CedBBp69WP\nIt's comparable to other 1B models on a couple evals :)\n\nWe're sharing it all with you: the dataset, prompts, and end-to-end pipeline https://t.co/91UrNuM0it\n\nThis is version 0.1 of the dataset, with significant room for improvement: additional generation styles, languages, better coverage of scientific topics and even better models! Super excited about what the community will build on top of it 🚀. Enjoy! ✨\n\nBonus: distribution charts! 📊\nWe build each prompt based on a sample in a seed dataset (e.g the web or Stanford courses), we ask Mixtral to generate a specific format (e.g textbook or a story) and for target a specific audience.",
  "reply_count": 9,
  "retweet_count": 61,
  "favorite_count": 189,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [
    {
      "url": "https://t.co/DlPgPCycic",
      "expanded_url": "https://hf.co/datasets/HuggingFaceTB/cosmopedia",
      "display_url": "hf.co/datasets/Huggi…"
    }
  ],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GGzS0j6aAAAnXMR.jpg",
      "type": "photo"
    },
    {
      "media_url": "https://pbs.twimg.com/media/GGzTA9dakAAaKXz.jpg",
      "type": "photo"
    },
    {
      "media_url": "https://pbs.twimg.com/media/GGzTwS5asAAZbkl.png",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/LoubnaBenAllal1/status/1760015709319348258",
  "created_at": "2024-02-20T18:56:59.000Z",
  "#sort_index": "1760015709319348258",
  "view_count": 36596,
  "quote_count": 6,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/loubnabenallal1/status/1760015709319348258"
}