@anneouyang
Excited to share what friends and I have been working on at @Standard_Kernel We've raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels. We have some great H100 BF16 kernels in pure CUDA+PTX, featuring: - Matmul 102%-105% perf of cuBLAs in 100 lines of code - Attention 104% perf of FlashAttention3 in 500 lines - Fused Llama3 FFN 120% perf of PyTorch (gpt-fast) Reach out if you want to work on AI kernel gen with us!