@PKirgis
Yesterday, we announced CRUX, a project that aims to conduct regular βopen-world evaluations,β where we will be testing the ability of AI agents to complete long-horizon tasks in messy, real-world environments. @sayashk's post dives into the details; here are a few of my own thoughts about why this is worth doing.