@hanlin_hl
Multimodal LLMs (MLLMs) excel at reasoning, layout understanding, and planning—yet in diffusion-based generation, they are often reduced to simple multimodal encoders. What if MLLMs could reason directly in latent space and guide diffusion generation with fine-grained, spatiotemporal control? 🤔 Introducing MetaCanvas 🎨 A lightweight framework that translates MLLM reasoning into structured spatiotemporal conditions for diffusion models. 🧵 👇