@dair_ai
New research from NVIDIA. Long-running agentic tasks like deep research require multi-hop reasoning over many documents. One of the biggest challenges with agents is that context grows rapidly, and KV cache memory usage becomes the bottleneck. As agents take on longer tasks, memory management can't rely on static heuristics. Letting the model manage its own context is both more effective and more adaptive. Existing cache compression techniques use fixed heuristics to decide what to keep. But in agentic reasoning, a token that seems unimportant early on may become critical ten turns later. This new NVIDIA research paper introduces SideQuest, a framework where the reasoning model itself manages its own KV cache. The model reasons about which tokens are still useful and clears the rest, essentially performing its own memory garbage collection. This management runs as an auxiliary task in parallel with the main reasoning thread, so the management tokens never pollute the primary context. That's important. Trained with just 215 samples, SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal accuracy loss, outperforming all heuristic-based compression techniques. Paper: https://t.co/n3P6UjtLJ7 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c