@dair_ai
New research on Self-Evolving AI Agents. Really interesting benchmark for evaluating a critical but overlooked capability: can LLMs create reusable tools from scratch, not just use existing ones? Tool-Genesis tests whether models can infer interfaces, generate schemas, and implement working code from natural language descriptions alone. Why does it matter? Self-evolving agents need to build their own tools, not just pick from a menu. Current models produce plausible-looking interfaces that quietly break downstream, revealing a key bottleneck for autonomous tool creation. The most promising finding: closed-loop repair, letting models debug via execution feedback, dramatically improves results. But the gains are scale-dependent; smaller models can't exploit that feedback as well. Paper: https://t.co/ez1Sg1Igqn Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c