@dair_ai
Agent skills look great in demos. Hand them a curated toolbox, and they shine. But what happens when the agent has to find the right skill from a large, unfiltered collection on its own? New research benchmarks LLM skill usage in realistic settings and finds that performance gains degrade consistently as conditions become more realistic, with pass rates approaching no-skill baselines. The fix is to introduce query-specific skill refinement, which substantially recovers lost performance. On Terminal-Bench 2.0, this approach improved Claude Opus 4.6's pass rate from 57.7% to 65.5%. As skill and tool ecosystems grow, agents won't have curated toolboxes handed to them. They'll face noisy, overlapping, and irrelevant options. Paper: https://t.co/Dm7JxredRI Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c