@jclin808
📉SFT might not suffer as much catastrophic forgetting as you think. Lately, much debate around GRPO in the community. RL is hot—but let’s not forget, in the context of LLMs: SFT is the bedrock of almost all RL. Also, there’s still a lot we don’t fully understand about SFT. Paper link: https://t.co/iawopsRn7b 🤔We revisit domain-specific SFT and find that even only with a small learning rate, you can achieve a sweet trade-off: (1) General-purpose degradation is largely mitigated; (2) Target-domain performance stays strong as the larger lr. From both theory & experiments, we next propose TALR (Token-Adaptive Loss Reweighting)—a method that further alleviates forgetting and achieves favorable trade-offs. #GRPO #LLM #Amazon #Claude #DeepSeek #GLM