@gerardsans
@recatm Xie Saining is spot on. Internet text comes pre-baked with human POV and labels. Raw images/video donβt. One scene can be shot from endless angles, camera height, distance, tilt, 1st-person vs 3rd-person, each completely changing the signal. This is why vision needs far stronger inductive biases than language. Text has it embedded. Vision is missing data to fill this gap.