@HuggingPapers
Imagination Helps Visual Reasoning, But Not Yet in Latent Space Causal mediation analysis reveals latent visual reasoning in MLLMs fails: latent tokens ignore inputs and barely affect answers. CapImagine, a text-based alternative, teaches explicit imagination and significantly outperforms latent baselines.