@arankomatsuzaki
ClawBench: Can AI Agents Complete Everyday Online Tasks? A real-world benchmark for AI agents: 153 everyday online tasks across live websites (shopping, booking, job apps). Even top models struggle—dropping from ~70% on sandbox benchmarks to as low as 6.5% here. https://t.co/ANUnjY8rlV