@rohanpaul_ai
New @AIatMeta builds a vision language world model that turns videos into text plans and reasons to pick better actions. 27% higher Elo for system-2 planning over system-1. The gap it tackles, agents must predict how actions change the world rather than only label frames. VLWM, the Vision Language World Model, represents the hidden state in plain language, predicting a goal and interleaved actions with their state changes. Training targets come from a Tree of Captions that compresses each video, then an LLM refines them into goals and state updates. The model jointly learns a policy to propose the next action and a dynamics model to predict the next state. In fast mode it completes the plan text left to right, which is quick but can lock in early mistakes. In reflective mode it searches candidate plans, rolls out futures, and picks the lowest cost path. The critic that supplies this cost is trained without labels by ranking valid progress below distractors or shuffled steps. Across planning benchmarks and human head to head comparisons, reflective search produces cleaner, more reliable plans. ---- Paper ā arxiv. org/abs/2509.02722 Paper Title: "Planning with Reasoning using Vision Language World Model"