@dair_ai
Why do RL runs on LLMs blow up even when the recipe looks right? GEOALIGN, from the Alibaba team behind Qwen, points at the rollouts. A handful of bad batches push the policy in incoherent directions, and most stability tuning just damps the symptom. This work curates rollouts by their geometry, removing the samples that make update directions conflict before they destabilize training. Why does it matter? If instability is largely a bad-batch problem, rollout curation is a lower-effort lever than another round of KL or clip tuning. You fix the data going into the update rather than fighting the optimizer. Paper: https://t.co/tUAYC57MVy Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c