@omarsar0
This is a fascinating paper. It's well known that long-horizon agents have a memory problem. The standard approach is to append everything to the context. Every past observation, action, and thought gets added to the prompt. This creates three compounding issues: O(N) memory growth, degraded reasoning on out-of-distribution context lengths, and attention dilution that causes the model to forget key details even when they're technically in the prompt. This new research unifies memory and reasoning into a single process. It introduces MEM1, an RL framework that trains agents to maintain constant memory across arbitrarily long multi-turn tasks. At each turn, the model updates a compact internal state that simultaneously consolidates prior information and reasons about next actions. After each turn, all previous observations, actions, and states are discarded. Only the most recent internal state remains. Inference-time reasoning serves two purposes. While reasoning about the current query, the model also extracts and stores exactly what it needs for future turns. Memory consolidation becomes part of the reasoning process itself, not a separate module. Training uses PPO with a masked trajectory technique. Because MEM1 dynamically prunes context, standard policy optimization breaks since tokens don't belong to a single continuous trajectory. The authors solve this by stitching sub-trajectories together and applying 2D attention masks that restrict each token's attention to only what was visible when it was generated. The results show dramatic efficiency gains: On 16-objective multi-hop QA, MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct. Peak token usage stays nearly constant as task complexity increases, while baseline methods scale linearly. On WebShop navigation, MEM1-7B outperforms AgentLM-13B (twice the parameters) with 2.8x less peak token usage. Notably, the agent trained on 2-objective tasks generalizes to 16-objective tasks. Performance actually improves relative to baselines as horizon length increases, because baseline models degrade on out-of-distribution context lengths while MEM1 maintains constant context. Emergent behaviors appear in the trained agents: maintaining structured memory for multiple concurrent questions, shifting focus when one objective stalls, and interleaving reasoning with selective memory updates. External memory modules require separate training and engineering overhead. Full-context approaches don't scale. MEM1 shows that end-to-end RL can train models to consolidate memory as part of reasoning, achieving both efficiency and performance without architectural changes. Paper: https://t.co/q9pEIxBpit Learn to build effective AI Agents in my academy: https://t.co/JBU5beIoD0