🐦 Twitter Post Details

Viewing enriched Twitter post

@_akhaliq

Vision-Flan Scaling Human-Labeled Tasks in Visual Instruction Tuning Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "type": "photo",
      "url": "https://pbs.twimg.com/media/GGwPF5gWoAAjfUu.png",
      "media_url": "https://pbs.twimg.com/media/GGwPF5gWoAAjfUu.png",
      "filename": "media_0.jpg"
    }
  ],
  "nlp": {
    "sentiment": "neutral",
    "processed_at": "2025-08-06T12:53:04.276090"
  },
  "original_structure": "had_media_only",
  "enhanced_from_raw_response": true,
  "enhanced_at": "2025-08-13T17:20:00Z"
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2014-04-27T00:20:12.000Z",
    "default_profile_image": false,
    "description": "AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗)\n\ndm for promo",
    "fast_followers_count": 0,
    "favourites_count": 29719,
    "followers_count": 295546,
    "friends_count": 2398,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 3707,
    "location": "subscribe → ",
    "media_count": 15522,
    "name": "AK",
    "normal_followers_count": 295546,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/2465283662/1610997549",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1451191636810092553/kpM5Fe12_normal.jpg",
    "screen_name": "_akhaliq",
    "statuses_count": 26684,
    "translator_type": "none",
    "url": "https://t.co/TbGnXZJwEc",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "2465283662"
  },
  "id": "1759798160011018397",
  "conversation_id": "1759798160011018397",
  "full_text": "Vision-Flan\n\nScaling Human-Labeled Tasks in Visual Instruction Tuning\n\nDespite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.",
  "reply_count": 2,
  "retweet_count": 42,
  "favorite_count": 139,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GGwPF5gWoAAjfUu.png",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/_akhaliq/status/1759798160011018397",
  "created_at": "2024-02-20T04:32:31.000Z",
  "#sort_index": "1759798160011018397",
  "view_count": 19668,
  "quote_count": 1,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/_akhaliq/status/1759798160011018397"
}