@Birchlabs
measured xformers flash attention 2 against torch sdp attention. I'm not noticing any speedup, except 2.6% speedup for batch-of-8. maybe it doesn't get to use the new optimization (parallelizing over sequence length), because of large batch or num_heads? https://t.co/mIGPp1IM5p