@qineng_wang
Most VLM benchmarks watch the world; few ask how actions *change* it from a robot's eye. Embodied cognition tells us that intelligence isn't just watching – it's enacted through interaction. 👉We introduce ENACT: A benchmark that tests if VLMs can track the evolution of a home-scale environment from a robot's egocentric view. 🌐https://t.co/2yBqVcGiKk 📄https://t.co/iqe3HgSMnD 1/N