@gabriberton
The term VLM has two related but very different meanings and it's so confusing 1) CLIP-like VLMs: 2 encoders trained from scratch 2) Llava-like VLMs: a vision encoder attached to an LLM, both pretrained Ugly image generated with nano banana of course https://t.co/JrhVsJxySq