@omarsar0
NEW research from NVIDIA. Post-training agents with RL is powerful but expensive. Every parameter update needs full multi-turn rollouts with environment interactions, making end-to-end RL prohibitively costly for long-horizon agentic tasks. This research offers a practical middle ground. The work introduces PivotRL, a framework that operates on existing SFT trajectories to combine the computational efficiency of SFT with the out-of-domain retention of end-to-end RL. Instead of exhaustive full-trajectory rollouts, PivotRL identifies pivots, informative intermediate turns where sampled actions show mixed outcomes, and trains only on those high-signal moments. Standard SFT degrades OOD performance by -9.83 points on average. PivotRL stays near zero (+0.21) while achieving +14.11 average in-domain gains over the base model versus +9.94 for SFT. On SWE-Bench, PivotRL reaches competitive accuracy with E2E RL using 4x fewer rollout turns and 5.5x less wall-clock time. The method is already deployed in production as the workhorse for NVIDIA's Nemotron-3-Super-120B agentic post-training. Paper: https://t.co/sIBLUpyfMD Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX