@rasbt
@anupbhat30 You can tune hparams such GQA and MLA have roughly the same KV caches size for each model size, but yeah, they question is which one has the better modeling performance for the same size. I think the jury is still out, although rumors have it that MLA doesn't do that well for small sizes. Unfortunately, there is no ablation study across sizes though to say anything more concrete.