@xxtiange
‼️VLMs/MLLMs do NOT yet understand the physical world from videos‼️ In our recent work, we found that even the most advanced AI models still lag behind humans in one key aspect: reasoning about the kinematic properties of objects from videos. Takeaways: 1. ChatGPT 5.1 leads overall among 21 advanced VLMs, followed by Gemini 2.5 Pro/Flash. 2. Grok 4.1 delivers impressive performance at the lowest API cost. 3. Qwen3-VL is the top-performing open-source model. Read here: https://t.co/5lagvLNE37 🧵1/N