🐦 Twitter Post Details

Viewing enriched Twitter post

@DrJimFan

Let's reverse engineer GPT-4V's uncanny ability to convert screenshots/sketches to code. Believe it or not, it's actually a (relatively) easier training task, because synthetic data can be scaled up massively. No insider info, but this is how I'd do it: 1. Scrape lots of websites and their code. Use a lightweight model (3.5) to clean up the code, and Selenium to render the screenshots. This becomes an initial training dataset of (Image, code). 2. Now given a screenshot, ask the model to generate code, and execute it in an actual browser. This step may throw errors, but GPT-4 is good at self-debugging. Fix all obvious runtime errors after a few rounds of iterative refinement. 3. The code is runnable now, but the rendered website may not follow the instruction image completely. Enters a very powerful idea from agent learning, called "Hindsight Relabeling" (https://t.co/f65BsP1tRk, authored by OpenAI in 2017). Basically, the wrong end product is actually correct given the current code. Instead of following (Image1, code), GPT-4V generates code -> Image2. Now the data pair (Image2, code) is actually groundtruth, which can be added to the training dataset. 4. Do aggressive data augmentation: change fonts, move around HTML elements, swap out background, add lots of noises. Combined with extraordinary OCR abilities, it's conceivable that enough data augmentation will help GPT-4V generalize to hand-drawn sketches, such as the napkin demo that @gdb did in February. Video credit: @mckaywrigley

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2012-12-12T22:11:27.000Z",
    "default_profile_image": false,
    "description": "@NVIDIA Senior AI Scientist. @Stanford PhD. Join me on the frontier of AI Agents, LLM & Robotics. MineDojo (NeurIPS Best Paper), Voyager. Ex: @OpenAI, @GoogleAI",
    "fast_followers_count": 0,
    "favourites_count": 6164,
    "followers_count": 140798,
    "friends_count": 2796,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 2856,
    "location": "Views my own. Get in touch →",
    "media_count": 633,
    "name": "Jim Fan",
    "normal_followers_count": 140798,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1007413134/1672408318",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1554922493101559808/SYSZhbcd_normal.jpg",
    "screen_name": "DrJimFan",
    "statuses_count": 2939,
    "translator_type": "none",
    "url": "https://t.co/H4rXo4Ei8X",
    "verified": false,
    "withheld_in_countries": [],
    "id_str": "1007413134"
  },
  "id": "1707440731558969714",
  "conversation_id": "1707440731558969714",
  "full_text": "Let's reverse engineer GPT-4V's uncanny ability to convert screenshots/sketches to code.\n\nBelieve it or not, it's actually a (relatively) easier training task, because synthetic data can be scaled up massively.\n\nNo insider info, but this is how I'd do it:\n\n1. Scrape lots of websites and their code. Use a lightweight model (3.5) to clean up the code, and Selenium to render the screenshots. This becomes an initial training dataset of (Image, code).\n\n2. Now given a screenshot, ask the model to generate code, and execute it in an actual browser. This step may throw errors, but GPT-4 is good at self-debugging. Fix all obvious runtime errors after a few rounds of iterative refinement. \n\n3. The code is runnable now, but the rendered website may not follow the instruction image completely. Enters a very powerful idea from agent learning, called \"Hindsight Relabeling\" (https://t.co/f65BsP1tRk, authored by OpenAI in 2017). \nBasically, the wrong end product is actually correct given the current code. Instead of following (Image1, code), GPT-4V generates code -> Image2. Now the data pair (Image2, code) is actually groundtruth, which can be added to the training dataset. \n\n4. Do aggressive data augmentation: change fonts, move around HTML elements, swap out background, add lots of noises. Combined with extraordinary OCR abilities, it's conceivable that enough data augmentation will help GPT-4V generalize to hand-drawn sketches, such as the napkin demo that @gdb did in February.\n\nVideo credit: @mckaywrigley",
  "reply_count": 34,
  "retweet_count": 192,
  "favorite_count": 1155,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/ext_tw_video_thumb/1707436967313698820/pu/img/8TO7REpbKQb9hbGz.jpg",
      "type": "video",
      "video_url": "https://video.twimg.com/ext_tw_video/1707436967313698820/pu/vid/avc1/720x1280/vWAKA-Gma3noY8fg.mp4?tag=12"
    }
  ],
  "url": "https://twitter.com/DrJimFan/status/1707440731558969714",
  "created_at": "2023-09-28T17:02:47.000Z",
  "#sort_index": "1707440731558969714",
  "view_count": 207023,
  "quote_count": 13,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/drjimfan/status/1707440731558969714"
}