@OpenBMB
π₯ Ultra-FineWeb-en-v1.4 is coming! 2.2T tokens fully open-sourced! The core training fuel for MiniCPM4 / 4.1, fully updated based on FineWeb v1.4.0: π What's New 1οΈβ£ Fresher Data: Added CommonCrawl snapshots from Apr 2024 - Jun 2025 to capture the latest world knowledge. 2οΈβ£ Easier Access: CC Dump Slices are here! No need to download the entire massive dataset anymore, fetch exactly what you need seamlessly. β‘ Highlights & Performance - Efficient Verification: Efficient Verification Strategy: Reduces data verification cost by 90% - High-Efficiency Filtering Pipeline: Optimizes selection of both positive and negative samples - Performance Gains: +3.613/+1.331 (Eng) & +1.98/+0.61 (Chn) vs. FineWeb/FineWeb-edu & Chinese FineWeb-edu-v2. Still high-quality cleaning. Still true to the open-source spirit. Welcome to download and test! π π Resources π€ Dataset: https://t.co/KluL5t2kUn π Paper: https://t.co/Kg9LLUqZgB π§© Classifier:https://t.co/oUfxrN6AmP π€ MiniCPM4:https://t.co/IQ82jD1PTi #UltraFineWeb #MiniCPM4 #AI #LLM #OpenBMB #UltraData