@Yihe__Deng
Large Vision Language Models are prone to object hallucinations โ how to cost-efficiently address this issue? ๐ Introducing MARINE: a training-free, API-free framework to tackle object hallucinations. Joint work with an amazing team @linxizhao4 @WeitongZhang and @QuanquanGu! arXiv: https://t.co/Lg3NUIaNaw Incorporating a pre-trained object grounding vision encoder, MARINE enriches the visual context of LVLMs and controls the text generation via classifier-free guidance (CFG) specifically designed for the multi-modal setting. MARINE corrects hallucinations without extra fine-tuning or accessing advanced LLMs ๐ค Compatible with any vision model, we showcase its effectiveness using DEtection TRansformer (DETR) as the object grounding vision encoder in our study. ๐ Tested on six widely-recognized LVLMs with MSCOCO, MARINE outperforms current methods in reducing hallucinations, verified by the commonly used CHAIR and POPE metrics. ๐งช Our ablation studies shed light on how varying guidance strengths affect MARINE's performance and generations. We provide concrete examples demonstrating how this guidance tweaks the LVLMs' output logits. ๐ Check the detail [1/N]