@omarsar0
How it’s built The authors propose HOMERAGENTS, a multi-agent framework that automates the generation of long-horizon workflow benchmarks. HOMERAGENTS has two paths: HOMERAGENTS+ iteratively turns atomic OfficeBench items into rich multi‑day dialogues via a generator‑verifier loop. This leads to OdysseyBench+. HOMERAGENTS‑NEO, which explores an app environment, generates tasks (intent, subtasks, eval criteria), and then synthesizes 5‑day dialogues. All agents use GPT‑4.1; at least five calendar days of dialogue are produced per task.