🐦 Twitter Post Details

Viewing enriched Twitter post

@reach_vb

Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥 > Understands and processes images, speech, and text > Generates real-time speech responses > Supports interruptions during speech Technical Overview: > Concatenates image, audio, and text features for input. > Uses text-guided delayed parallel output for real-time speech > Involves encoder adaptation, modal alignment, and multimodal fine-tuning Best part: MIT licensed ⚡

View on Twitter

📊 Media Metadata

{
  "score": 0.76,
  "scored_at": "2025-08-09T13:46:07.550602",
  "import_source": "network_archive_import",
  "media": [
    {
      "type": "video",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1850895844167286859/media_0.mp4?",
      "filename": "media_0.mp4"
    },
    {
      "id": "",
      "type": "video",
      "url": null,
      "media_url": "https://pbs.twimg.com/ext_tw_video_thumb/1850892710057476096/pu/img/C7-wEGl5pp06mW0v.jpg",
      "media_url_https": null,
      "display_url": null,
      "expanded_url": null
    }
  ],
  "reprocessed_at": "2025-08-12T15:25:35.592360",
  "reprocessed_reason": "missing_media_array",
  "original_structure": "had_both"
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2017-06-14T13:50:54.000Z",
    "default_profile_image": false,
    "description": "GPU poor @Huggingface | F1 fan | Here for @at_sofdog’s wisdom | *opinions my own",
    "fast_followers_count": 0,
    "favourites_count": 9835,
    "followers_count": 20342,
    "friends_count": 257,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 444,
    "location": "nvidia-smi",
    "media_count": 1158,
    "name": "Vaibhav (VB) Srivastav",
    "normal_followers_count": 20342,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/874987512850128897/1651638851",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1509901130670747666/JFlrSzB4_normal.jpg",
    "screen_name": "reach_vb",
    "statuses_count": 6133,
    "translator_type": "none",
    "url": "https://t.co/83qEAS8N0w",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "874987512850128897"
  },
  "id": "1850895844167286859",
  "conversation_id": "1850895844167286859",
  "full_text": "Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥\n\n> Understands and processes images, speech, and text\n> Generates real-time speech responses\n> Supports interruptions during speech\n\nTechnical Overview:\n> Concatenates image, audio, and text features for input. \n> Uses text-guided delayed parallel output for real-time speech\n> Involves encoder adaptation, modal alignment, and multimodal fine-tuning\n\nBest part: MIT licensed ⚡",
  "reply_count": 8,
  "retweet_count": 70,
  "favorite_count": 347,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/ext_tw_video_thumb/1850892710057476096/pu/img/C7-wEGl5pp06mW0v.jpg",
      "type": "video",
      "video_url": "https://video.twimg.com/ext_tw_video/1850892710057476096/pu/vid/avc1/1482x720/lzR7X9bfJQUsTaF3.mp4?tag=12"
    }
  ],
  "url": "https://twitter.com/reach_vb/status/1850895844167286859",
  "created_at": "2024-10-28T13:42:11.000Z",
  "#sort_index": "1850895844167286859",
  "view_count": 44495,
  "quote_count": 3,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://x.com/reach_vb/status/1850895844167286859"
}