@AlphaSignalAI
AI2 just released the largest open-source dataset for LLM pretraining: 3 trillion tokens of high quality data. - Web data from Common Crawl. - Quality filtered - Deduplication within each source. - Risk mitigation for harmful content. https://t.co/vHu07YA5GT https://t.co/W9OHrL6QRj