@PyTorch
PyTorch and Nebius have collaborated to speed-up pre-training DeepSeek-V3 (16B & 671B) on 256 NVIDIA B200 GPUs using TorchTitan. Combining MXFP8 via TorchAO with DeepEP yielded up to 41% faster training throughput over BF16 (with equivalent convergence). Full results and reproducible configs in the blog: š https://t.co/udvt6W86FM ā Alireza Shamsoshoara, Matthias Reso, Hamid Shojanazeri, @vega_myhre, Hooman Ramezani #PyTorch #OpenSourceAI #TorachAO #MXFP8