@JunweiLiangCMU
Meet DiT4DiT, the FIRST video generation architecture for humanoid robot control. 🤖✨ By treating video generation as a world model, we give robots real "physical intuition." 🔥 The Results: 🚀 >10x better sample efficiency & up to 7x faster convergence! 🏆 SOTA on LIBERO (98.6%) & RoboCasa-GR1 (50.8%). 🦾 Zero-shot generalization on the Unitree G1 humanoid using just monocular vision (1x speed, fully autonomous). 🧠 How it works: We couple a Video DiT with an Action DiT via a dual flow-matching objective. Instead of relying on fully reconstructed future frames, we extract "intermediate denoising features" to guide action prediction—simple but highly effective! Check out the paper, real-world videos, and project page here: https://t.co/Ml0AA8PKqA #EmbodiedAI #Robotics #MachineLearning #WorldModels