@kushal_tirumala
Excited to release our work in data selection for LLM pre-training! We introduce a new data selection method for large-scale web data (D4) which gets ~20% efficiency gains & +2% downstream acc @ 6.7B scale over the current standard of randomly sampling Minhash deduped web docs https://t.co/imH9K5rSfx