@reach_vb
Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥 > Understands and processes images, speech, and text > Generates real-time speech responses > Supports interruptions during speech Technical Overview: > Concatenates image, audio, and text features for input. > Uses text-guided delayed parallel output for real-time speech > Involves encoder adaptation, modal alignment, and multimodal fine-tuning Best part: MIT licensed ⚡