@arankomatsuzaki
Reinforcement Learning for Reasoning in Large Language Models with One Training Example - 36.0% -> 73.6% on MATH500 by performing RLVR on a single example - Applying entropy loss alone, without any outcome reward, improves perf by 27.4% https://t.co/09jHOmxBTh