@PyTorch
A new #Triton BF16 Persistent Cache-Aware Grouped #GEMM kernel speeds up Mixture-of-Experts models like DeepSeekv3. It achieved up to 2.62x faster training on NVIDIA H100 GPUs compared to the #PyTorch loop baseline. š Latest blog from @Meta & @IBM: https://t.co/7JueS7v6he https://t.co/7zTg2OAFki