@dair_ai
New research on LLM Agent Generalization. RL fine-tuning makes agents strong in familiar environments, but it struggles to transfer across unseen ones. This paper systematically studies RL generalization for LLM agents across three axes: within-environment transfer across task difficulty, cross-environment transfer to unseen settings, and sequential multi-environment training. Within an environment, RL delivers massive gains. Training on easy WebShop tasks improves hard task performance by 60+ points. Easy-to-hard curriculum learning adds another 2-3 points on top. Across environments, transfer is weak. Agents average only 3.3-3.4 point improvements on unseen environments. Training on BabyAI actually drops WebShop from 28.6 to 10.3. Sequential training is where it gets interesting. Training across five environments sequentially achieves performance comparable to joint training, with minimal forgetting. The authors claim that RL fine-tuning doesn't produce generally capable agents out of the box. But sequential training across diverse environments offers a practical path to broad competence. Paper: https://t.co/BYfVK3DPoH Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c