@_philschmid
Can we scale synthetic data to a pertaining level? 🤔 Yes, we can‼️ Cosmopedia just released the largest open synthetic dataset with 25B tokens across textbooks, blog posts, and more, generated by Mixtral-8x7B-Instruct-v0.1 using ~16,000 H100 GPU hours. Approach: 1️⃣ Collected unsupervised data (web, education, existing datasets) 2️⃣ Create a diverse set of prompts that can rephrase/generate new data from the original content, e.g. - Write an educational story (3-5 paragraphs) targeted at young children - Write a long and very detailed tutorial based on the website 3️⃣ Used LLM-swarm and Mixtral to generate synthetic data. => Leading to less than 1% of duplicates generated 💡 Cosmopedia isn't created from thin air; It comes by using existing data (from lower quality) and rephrasing it into high-quality content like textbooks using LLMs. (can include hallucinations)