@dmsobol
After part 3 of MoE 101 series we got two main questions: 1. why is MoE forward pass slower than dense network? 2. why can't I train 64 experts on a single GPU and hit OOM? we discuss both problems and solutions in part 4: https://t.co/uW6H78ZE56 1/n 🧵