@dair_ai
Highly-recommended read from MIT on the part of RL with verifiable rewards that everyone keeps hitting. RLVR only optimizes what you can objectively score, so style, structure, and diversity quietly collapse and reward hacking creeps in. The fix here adds an adversarial discriminator trained on human demonstrations, which acts as a learned proxy for the human output distribution. The generator maximizes both task accuracy and the discriminator's human-likeness signal, so verifiable rewards and imitation of humans get optimized together. Why does it matter? Across bug fixing, story generation, and a reward-hacking benchmark, this preserves RLVR's accuracy gains while restoring the fuzzy properties it usually destroys. Bug fixes come out with much lower edit distance, stories score higher win rates and stay diverse, and misbehavior nearly disappears. Paper: https://t.co/kBZA66WGyC Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c