@lukaskuhn77
๐ฅ We introduce LeVLJEPA: the first fully non-contrastive end-to-end vision-language pretraining method competitive with CLIP & SigLIP ๐ช๐ผ ๐ No negatives. No temperature. No momentum encoder. No teacher-student. TL;DR: LeVLJEPA learns image to text structure by prediction: each modality predicts the other's embedding, while SIGReg keeps each embedding isotropic Gaussian. ๐งต ๐ https://t.co/1qBXor8qTf