@dair_ai
New research: FlashAttention-4 FlashAttention-4 achieves up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton on B200 GPUs with BF16. FlashAttention-4 co-designs algorithms and kernel pipelines for Blackwell GPUs, where tensor core throughput doubles but memory bandwidth and exponential units scale more slowly. The techniques include fully asynchronous MMA operations, software-emulated exponential rescaling, and leveraging tensor memory to reduce shared memory traffic. FlashAttention-4 achieves up to 1.3x speedup over cuDNN and 2.7x over Triton on B200 GPUs, reaching 1613 TFLOPs/s at 71% utilization. Implemented entirely in Python via CuTe-DSL with 20-30x faster compile times compared to C++ templates. Paper: https://t.co/wBiS51m8Bm Learn to build effective AI agents in our academy: https://t.co/LRnpZN7deE