@Ofirlin
Can we have an optimizer as fast as Muon but with a reduced memory footprint? In our recent NeurIPS paper, we show it's possible and introduce SUMO🎉 Muon's speed comes from fast moment orthogonalization using Newton-Schultz (NS). But its NS approximation breaks down when gradients are projected into a low-dim subspace. SUMO's fix: exact SVD orthogonalization inside a low-rank subspace, gives us Muon-level geometry awareness at a fraction of the memory cost (1/5)