@LoubnaBenAllal1
🌌 Cosmopedia: The largest open synthetic dataset of textbooks, blogposts and stories generated by Mixtral with a total of 25B tokens and 30M files 🚀 https://t.co/qtM06YXex7 A little backstory to this "cosmic" journey: Two weeks ago I started experimenting with some cool web clustering from @lvwerra and synthetic data prompts by @gui_penedo and @Thom_Wolf . I was incredibly impressed by the quality and diversity of the generations when using Mixtral-8x7B-Instruct-v0.1. Then 600 H100 GPUs became free on the HF cluster for a night! 💡 Given we had the llm-swarm library by @vwxyzjn which scales efficiently for data generation, @anton_lozhkov and I scraped some data and launched the pipeline at full capacity, resulting in 25 billion tokens from textbooks, blog posts, and stories with GPT-3.5 quality, all under Apache2.0 license. This makes Cosmopedia the largest synthetic dataset available 🚀. We also trained a Phi-like model on it, cosmo-1b, to test the quality of the dataset: https://t.co/CedBBp69WP It's comparable to other 1B models on a couple evals :) We're sharing it all with you: the dataset, prompts, and end-to-end pipeline https://t.co/91UrNuM0it This is version 0.1 of the dataset, with significant room for improvement: additional generation styles, languages, better coverage of scientific topics and even better models! Super excited about what the community will build on top of it 🚀. Enjoy! ✨ Bonus: distribution charts! 📊 We build each prompt based on a sample in a seed dataset (e.g the web or Stanford courses), we ask Mixtral to generate a specific format (e.g textbook or a story) and for target a specific audience.