@iScienceLuvr
NVILA: Efficient Frontier Visual Language Models abs: https://t.co/4lk7WHWwYr NVIDIA introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Model arch focuses on scaling up spatial and temporal resolutions, and then compressing visual tokens, allowing for efficient processing of high resolutions. Also uses "DeltaLoss" data pruning and FP8 training. Competitive with proprietary VLMs on visual understanding benchmarks.
arXiv
NVILA: Efficient Frontier Visual Language Models
This paper presents NVILA, a family of open visual language models that optimize efficiency and accuracy, significantly reducing training and latency ...
• NVILA improves efficiency and accuracy of VLMs.
• Reduces training costs by 4.5X and latency by up to 2.8X.