@iScienceLuvr
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment "iBOT++ extends the patch-level self-supervised loss to all tokens for stronger dense alignment; Head-only EMA reduces training cost while retaining performance; and Multi-Granularity Captions uses PaliGemma and Gemini descriptions for richer text supervision."