🐦 Twitter Post Details

Viewing enriched Twitter post

@_philschmid

Can we scale synthetic data to a pertaining level? 🤔 Yes, we can‼️ Cosmopedia just released the largest open synthetic dataset with 25B tokens across textbooks, blog posts, and more, generated by Mixtral-8x7B-Instruct-v0.1 using ~16,000 H100 GPU hours. Approach: 1️⃣ Collected unsupervised data (web, education, existing datasets) 2️⃣ Create a diverse set of prompts that can rephrase/generate new data from the original content, e.g. - Write an educational story (3-5 paragraphs) targeted at young children - Write a long and very detailed tutorial based on the website 3️⃣ Used LLM-swarm and Mixtral to generate synthetic data. => Leading to less than 1% of duplicates generated 💡 Cosmopedia isn't created from thin air; It comes by using existing data (from lower quality) and rephrasing it into high-quality content like textbooks using LLMs. (can include hallucinations)

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2019-06-18T18:39:49.000Z",
    "default_profile_image": false,
    "description": "Tech Lead and LLMs at @huggingface 👨🏻‍💻 🤗  AWS ML Hero 🦸🏻 | Cloud & ML enthusiast | 📍Nuremberg | 🇩🇪 https://t.co/l1ppq3q3hk",
    "fast_followers_count": 0,
    "favourites_count": 3906,
    "followers_count": 13564,
    "friends_count": 591,
    "has_custom_timelines": false,
    "is_translator": false,
    "listed_count": 359,
    "location": "Nürnberg",
    "media_count": 334,
    "name": "Philipp Schmid",
    "normal_followers_count": 13564,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1141052916570214400/1582380032",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1714444511860887552/8TzsCn3e_normal.jpg",
    "screen_name": "_philschmid",
    "statuses_count": 1497,
    "translator_type": "none",
    "url": "https://t.co/8BDXIK6omb",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "1141052916570214400"
  },
  "id": "1760055289750126935",
  "conversation_id": "1760055289750126935",
  "full_text": "Can we scale synthetic data to a pertaining level? 🤔 Yes, we can‼️ Cosmopedia just released the largest open synthetic dataset with 25B tokens across textbooks, blog posts, and more, generated by Mixtral-8x7B-Instruct-v0.1 using ~16,000 H100 GPU hours.\n\nApproach:\n1️⃣ Collected unsupervised data (web, education, existing datasets)\n2️⃣ Create a diverse set of prompts that can rephrase/generate new data from the original content, e.g.\n        - Write an educational story (3-5 paragraphs) \n           targeted at young children\n        - Write a long and very detailed tutorial based on \n           the website\n3️⃣ Used LLM-swarm and Mixtral to generate synthetic data.\n\n=> Leading to less than 1% of duplicates generated\n\n💡 Cosmopedia isn't created from thin air; It comes by using existing data (from lower quality) and rephrasing it into high-quality content like textbooks using LLMs. (can include hallucinations)",
  "reply_count": 6,
  "retweet_count": 24,
  "favorite_count": 116,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GGz49GVawAAh8QZ.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/_philschmid/status/1760055289750126935",
  "created_at": "2024-02-20T21:34:15.000Z",
  "#sort_index": "1760055289750126935",
  "view_count": 20926,
  "quote_count": 2,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/_philschmid/status/1760055289750126935"
}