@PyTorch
One runtime, multiple GPU architectures, and zero vendor-specific model code. In this blog post, the TokenSpeed team @lightseekorg introduces TokenSpeed-Kernel, a portable, high-performance kernel system built for modern LLM inference. Using GPT-OSS 120B as a case study, they show how specialized kernels for @AIatAMD and @NVIDIAAI GPUs can seamlessly coexist behind a common API. This unified approach delivers up to 3.6x higher throughput on the AMD MI355X, all without requiring any changes to the underlying model logic. Link to blog in comments section 👇