@rasbt
There's currently a lot of talk about Mistral, but have you seen the new QA-LoRA paper? - LoRA (low-rank adaptation) is awesome because it adapts only a small, low-rank subset of parameters of a base LLM. - QLoRA is awesome because it lowered memory requirements even further by quantizing the base model weights. - QA-LoRA is even more awesome as it takes QLoRA a step further and also quantizes the LoRA (adapter) weights, avoiding a costly conversion of the quantized base model weights back into 16-bit when adding the adapter weights. This concept is summarized in the annotated figure below. A little nitpick: Table 2 shows that QA-LoRA is about 2x faster than QLoRA for fine-tuning. However, a much smaller number of parameters was used for the adapter weights. I believe it would have been fairer to use the same number of parameters for both when comparing their speeds.