@LiorOnAI
Robotics just proved it can scale like language models. SONIC trained a 42 million parameter model on 100 million frames of human motion and achieved 100% success transferring to real robots with zero fine-tuning. The breakthrough isn't the robot doing backflips. It's that someone finally found the "next token prediction" equivalent for physical movement. For years, training robots meant hand-crafting reward functions for every single skill. Want your robot to walk? Design rewards for balance, foot placement, energy efficiency. Want it to dance? Start over with entirely new rewards. This approach hits a wall because humans can't manually specify every nuance of natural movement. SONIC replaces this with motion tracking: the robot learns by watching 700 hours of motion capture data and trying to mimic it, frame by frame. The data itself becomes the reward function. Scale the data, scale the model, scale the compute, and performance improves predictably. Just like GPT. This unlocks something robotics has never had: a universal control interface. One policy handles: 1. VR teleoperation using head and hand tracking 2. Live webcam feeds converted to robot motion in real-time 3. Text commands like "walk sideways" or "dance like a monkey" 4. Music audio where the robot matches tempo and rhythm 5. Vision-language models for autonomous tasks (95% success rate) All inputs get encoded into the same token space, then decoded into motor commands. No retraining. No reward engineering. No manual retargeting between human and robot skeletons. If this holds, robotics just closed a 5-year gap with AI. Language models scaled by finding one task (predict the next word) that generalizes to everything. Vision models did the same with image classification. Robotics now has motion tracking. Expect the next wave of humanoid companies to train on billions of frames, not millions.