@askalphaxiv
A new mind-blowing finding: causal prediction alone is sufficient for strong visual learning, no need for any fancy reconstruction, masking, or contrastive loss In this paper NEPA, instead of reconstructing pixels, they train a vision model to autoregressively predict the next patch embedding and that alone yields strong visual understanding, matching or beating DINO and JEPA with a far simpler setup now trending on alphaXiv 📈