@askalphaxiv
Yann LeCun 🤝 Saining Xie insane crossover of the 2 biggest visual representation researchers in the AI field “Beyond Language Modeling: An Exploration of Multimodal Pretraining” Right now, most multimodal models are basically a language model with a vision adapter bolted on, so they can describe images, but they don’t really think in images or video. This paper shows what happens when you do it the hard way: train one model from scratch on text, images, and video with a unified setup. They key idea is if you give the model a good visual internal format and it can use vision for both understanding and generating. Additionally, multimodal data can improve language instead of distracting it, and mixture-of-experts lets you scale vision’s huge data intake without bloating everything else. This paves the way towards changing the vision paradigm from “captioning add-on” model to native multimodal foundation model.