@omarsar0
What's missing to build useful deep research agents? Deep research agents promise analyst-level reports through automated search and synthesis. However, current systems fall short of genuinely useful research. The question is: where exactly do they fail? This new paper introduces FINDER, a benchmark of 100 human-curated research tasks with 419 structured checklist items for evaluating report quality. Unlike QA benchmarks, FINDER focuses on comprehensive report generation. The researchers analyzed approximately 1,000 reports from mainstream deep research agents. Their findings challenge assumptions about where these deep research systems struggle. Current agents don't struggle with task comprehension. They fail at evidence integration, verification, and reasoning-resilient planning. They understand what you're asking. They just can't synthesize the answer reliably. The paper introduces DEFT, the first failure taxonomy for deep research agents. It identifies 14 distinct failure modes across three categories: reasoning failures, retrieval failures, and generation failures. This systematic breakdown reveals that the gap between current capabilities and useful research isn't about smarter search or better language models. It's about the reasoning architecture that connects retrieval to synthesis. (bookmark it) Paper: https://t.co/gAA7feYHm1