@baifeng_shi
Humans can see in high-res, high-FPS in real-time. Why can't VLMs? Introducing AutoGaze: ViTs/VLMs "gaze" only at key video regions! Up to 4-100x token savings, 19x speedup, and enables scaling to 4K-res 1K-frame videos. ๐ https://t.co/GhbWZwMAg7 ๐ https://t.co/mEJ991MAIR ๐ค https://t.co/FOfc2QRThi (1/n)๐งต